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ABSTRACT 


We  investigateiicertain  fundamental  problems  decentralized  decision  making  and 
computation.  We  study  the  problem  of  whether  a  set  of  decision  makers  (or 
processors)  with  different  (but  related)  information  may  make  compatible  decisions 
without  communication  and  we  characterize  the  computational  complexity  of  this 
problem.  We  also  analyze  the  complexity  of  other  basic  problems  of  decentralized 
decision  making,  such  as  decentralized  detection  (hypothesis  testing) . 

We  then  consider  a  scheme  whereby  a  set  of  decision  makers  (processors)  exchange 
and  update  tentative  decisions  which  minimize  a  common  cost  function,  given  the 
information  they  possess;  we  show  that  they  are  guaranteed  to  converge  to 
consensus . 

Finally,  we  consider  a  broad  class  of  asynchronous  distributed  (deterministic  and 
stochastic)  iterative  optimization  algorithms  tolerating  communication  delays. 

We  associate  communication  requirements  with  such  algorithms  and  show  that  they 
converge  appropriately  under  certain  conditions  which  are  no  more  severe  than 
those  required  for  their  centralized  counterparts. 

Several  applications  in  human  organizations,  parallel  computation  and  distributed 
signal  processing  are  indicated.  j  t  ~r~ .  <  /  ,  d  ,  •  r  ft* 
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CHAPTER  1 :  INTRODUCTION 


1.1  PROBLEM  DEFINITION 

The  subject  of  this  report  is  the  investigation  of  certain  central  problems 
in  decentralized  (distributed)*  decision  making  and  computation. 

Classical  (centralized)  decision  making  theory  deals  with  the  situation  in 
which  a  single  decision  maker  (DM)  (man  or  machine)  possesses  all  available  knowledge 
and  information  related  to  a  certain  system  and  has  to  make  a  decision  (or  a 
sequence  of  decisions)  so  as  to  achieve  a  certain  objective  (minimize  a  cost  criterion 
for  example) . 

Many  real  world  systems  (power  systems,  public  or  business  organizations , 
large  manufacturing  systems,  etc.),  however,  are  too  large  for  the  classical  model 
of  decision  making  to  be  applicable.  There  may  be  a  multitude  of  decision  makers  (or 
processors),  none  of  which  possesses  all  relevant  knowledge.  In  addition,  there  may 
be  limitations  on  the  amount  of  communications  allowed  between  distinct  decision 
makers,  so  that  it  is  impractical  to  exchange  all  available  information  and  convert 
the  problem  to  a  centralized  one.  In  fact,  even  if  there  exists  infinite  capacity 
for  communications,  still  centralization  may  be  inadvisable,  because  no  decision  maker 
may  have  the  capability  of  tackling  the  overall  problem  by  himself.  The  above  reasons 
make  decentralization  necessary,  whereby  severed,  decision  makers  make  their  own  deci¬ 
sions,  based  on  partial  information,  possibly  by  solving  a  problem  related  to  the 
original  one.  This  raises  the  need  to  structure  the  decentralized  decision  process 
so  that  the  outcome  of  the  joint  effort  of  the  various  decision  makers  achieves,  in 
some  sense,  the  goal  of  the  overall  system  (organization). 

From  the  point  of  view  of  the  theory  of  computation,  "centralized"  theory  deals 


with  the  case  in  which  a  single  serial  processor  is  to  execute  a  sequence  of  ins¬ 


tructions,  in  order  to  evaluate  a  desired  result  (value).  In  decentralized 
computation*,  the  same  goal  is  achieved  by  a  set  of  processors  operating  in  parallel 
and  exchanging  partial  results.  Parallel  computation  is  advantageous  in  many 
situations,  because  the  desired  final  result  may  be  evaluated  much  faster,  or  because 
the  input  data  of  the  computation  are  physically  distributed  (for  example,  if  the 
computation  consists  of  statistical  processing  of  data  acquired  by  physically  distinct 
sensors) .  Interesting  problems  arise  in  this  context  because  good  parallel  algorithms 
can  be  very  different  from  simple  adaptations  of  good  serial  algorithms. 

Concerning  decision  making  problems,  especially  in  human  organizations ,  it  is 
often  the  case  that  distinct  decision  makers  have  different  objectives,  which  are 
also  different  from  the  objective  of  the  organization.  This  may  lead  to  conflict  and 
to  situations  best  addressed  by  game  theory.  We  will  restrict,  however,  to  situations 
in  which: 

a)  There  is  a  well-defined  organizational  objective. 

b)  The  individual  decision  makers  are  either  physical  processors  (so  that  no 
interest  may  be  ascribed  to  them)  or  they  may  be  treated  -  for  the  purpose  of 
analysis  -  as  if  they  were  processors  with  predictable  behavior.  For  example, 
a  human  decision  maker  may  make  decisions  motivated  by  his  perceived  self- 
interest;  however,  if  an  analyst  knows  the  perceived  self-interest  of  that 
decision  maker,  he  may  be  able  to  predict  his  behavior.  From  that  point  on, 
self-interests  become  irrelevant:  as  fa~  as  analysis  is  concerned,  the  human 
decision  maker  may  be  modelled  as  a  processor,  operating  in  a  specific  way. 


*In  this  context,  the  terms  "parallel"  or  "distributed"  computation  axe  often  used. 
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Given  such  a  setting,  this  report  aims  at  the  better  understanding  of  the 
limitations  and  capabilities  of.  decentralized  systems.  Clearly,  any  issues  per¬ 
tinent  to  centralized  decision  problems  may  arise  in  decentralized  problems  as  well. 
Our  goal  is,  however,  to  focus  on  issues  which  are  unique  to  decentralized  systems, 
the  predominant  ones  being  the  effects  of  communi cations ,  or  of  the  impossibility 
of  communications. 

Our  work  does  not  center  around  any  particular  application.  The  selection  of 
the  problems  to  be  addressed  is  guided,  however,  by  applications  in: 

a)  Decentralized  (parallel)  computation. 

b)  Human  Organizations. 

c)  Decentralized  Signal  Processing. 

In  some  respects,  there  is  little  in  common  between  so  diverse  applications. 

On  the  other  hand,  they  all  involve  decentralization.  If  an  abstract  study  may  find 
applications  in  more  than  one  of  the  above  areas,  it  is  legitimate  to  claim  that  a 
common  denominator  behind  a  variety  of  decentralized  systems has  been  addressed.  This 
is  not  an  elusive  goal:  for  example,  Arrow  and  Hurwicz  [1960]  have  indicated  common 
features  of  decentralized  computation  and  the  operation  of  economic  markets. 

We  close  this  section  with  a  remark  on  terminology.  Depending  on  the  context, 
the  "entities"  in  a  decentralized  system  which  perform  computations  or  make  decisions 
are  often  called  processors,  decision  makers,  modules  or  agents.  Each  of  the  above 
terms  often  carries  certain  connotations,  which  we  wish  to  avoid.  In  particular, 
using  the  term  "decision  maker"  often  implies  the  existence  of  an  individual  interest 
on  the  basis  of  which  decisions  are  being  made.  If,  however,  we  have  a  human  who 
simply  does  exactly  what  he  has  been  told  to  do  (as  if  he  were  programmed) ,  the  term 


"agent"  seems  to  be  more  appropriate.  The  latter  term,  however,  sounds  unnatural 
in  a  situation  involving  only  computing  machines.  For  these  reasons,  we  prefer  to 
use  the  neutral  terms  "processor"  or  "module",  with  the  understanding  that,  in  some 
occasions,  they  may  refer  to  humans  who  process  information  and  apply  decisions  in  a 
prescribed  manner. 

1.2  OUTLINE  AND  OVERVIEW 

In  this  section  we  outline  the  contents  of  this  report,  with  the  main  goal  of 
highlighting  the  unity  and  continuity  of  the  chapters  that  follow. 

Chapter  2  addresses  certain  conceptual  issues  related  to  decentralized  systems. 
Starting  with  a  general  definition  of  decentralized  systems,  we  indicate  some  typical 
features  that  may  be  present.  We  then  discuss  the  nature  of  associated  design  pro¬ 
blems,  pointing  to  the  possibility  of  having  a  decentralized  system  designing  another 
decentralized  system. 

We  then  continue  with  an  abstract  discussion  and  a  schematic  history  of  actual 
organizations,  which  serves  as  a  motivation  for  some  of  the  problems  to  be  studied 
in  the  sequel,  together  with  a  brief  discussion  of  the  types  of  mathematical  problems 
raised  by  the  possibility  of  communications ,  motivating  again  the  work  that  follows . 
Chapter  2  ends  with  a  survey  of  some  of  the  literature  related  to  decentralized 
systems . 

Chapter  3  deals  with  the  simplest  but  fundamental  problems  of  decentralized • 
decision  making,  exploring  the  effect  of  communication  constraints. 

The  first  problem  we  raise  is  the  following:  given  a  set  of  processors  with 
partial  information,  is  it  possible  that  they  make  satisfactory  decisions  (in  a 


certain  sense) ,  without  communicating?  The  next  problem,  which  follows  logically 
from  the  preceding  one,  is:  if  it  turns  out  that  they  have  to  communicate,  what 
is  the  least  amount  of  communi cations  required?  We  pose  these  problems  in  a  simple 
setting  in  which  the  sets  of  possible  observations  and  decisions  are  finite.  We 
show,  however,  that  these  are  hard  combinatorial  problems.  We  also  investigate  a 
variety  of  special  cases  and  versions  of  the  basic  problems,  with  the  goal  of  deter¬ 
mining  the  boundary  between  easy  and  hard  problems.  Special  attention  is  paid  to 
the  well-known  problem  of  decentralized  hypothesis  testing  (detection) . 

The  results  of  Chapter  3  lead  to  the  conclusion  that  the  optimal  design  of  a 
decentralized  system,  even  in  the  absence  of  any  dynamics,  is  very  hard  computationally 
The  problem  "who  should  communicate  to  whom,  what,  when,  etc."  cannot  be  addressed, 
for  all  practical  purposes,  with  the  goal  of  optimizing  within  a  most  general  class 
of  communication  protocols.  Rather,  we  have  to  restrict  to  smaller,  more  tractable,’ 
classes  of  communication  protocols.  To  which  particular  class  of  protocols  one 
chooses  to  restrict,  unescapably  contains  a  certain  degree  of  arbitrariness. 

With  this  point  of  view,  we  consider,  in  Chapter  4,  a  particular  protocol 
concerning  a  set  of  processors  with  identical  prior  information  (Bayesian  models) , 
different  posterior  (on-line)  observations  and  identical  cost  functions.  These  proces¬ 
sors  are  assumed  to  exchange  (asynchronously)  tentative  decision*  (that  is,  decisions 
which  are  optimal  given  the  information  they  possess) .  We  show  that  the  decisions 
of  the  processors  will  asymptotically  converge  to  consensus  even  if  they  have  lim¬ 
ited  memory  and  forget  some  of  their  past  information,  in  the  course  of  the  process 
of  exchanging  tentative  decisions.  We  show  that  this  scheme  leads  to  a  decomposition 
algorithm  for  static  linear  estimation  problems.  We  also  discuss  briefly  what  could 
happen  if  the  decision  makers  had  different  cost  functions  and/or  models  (prior 


probabilities) . 


The  scheme  of  Chapter  4  will  be  often  hard  to  implement,  because  at  each 
stage  each  processor  has  to  evaluate  an  optimal  (tentative)  decision,  given  its 
information,  which  may  be  a  hard  non-linear  problem.  For  this  reason,  we  take  in 
Chapter  5  a  more  realistic  approach,  more  applicable  to  human  decision  makers  with 
bounded  rationality  (cognitive  limitations)  or  computing  machines  with  limited 
capabilities:  each  processor  makes  tentative  decisions  which  are  communicated  to 
other  processors;  concurrently  with  the  process  of  exchanging  messages,  each  proces¬ 
sor  updates  its  tentative  decision.  In  contrast  with  Chapter  4,  however,  these 
updates  are  not  optimal  in  a  Bayesian  sense;  rather,  they  are  small  updates  in  a 
direction  which  is  expected  to  improve  performance. 

The  scheme  of  Chapter  5,  may  be  viewed  from  several  different  perspectives:  as 
a  decentralized  algorithm  for  solving  am  optimization  problem,  in  which  case  ten¬ 
tative  decisions  are  local  states  of  computation;  alternatively,  as  a  process  of 
adjustment  of  humam  decision  makers  in  a  divisionalized  organization.  Of  course,  the 
different  perspectives  concern  the  interpretation  of  the  results,  not  the  mathematical 
analysis . 

We  now  outline  the  contents  of  Chapter  5,  in  more  technical  terms.  We  first 
develop  a  suitable  model  of  decentralized  computation  and  then  proceed  to  prove 
convergence  results  for  a  broad  class  of  asynchronous  decentralized  (deterministic 
and  stochastic)  optimization  algorithms,  tolerating  communication  delays.  These 
include  constauit  step-size  algorithms  (e.g.  gradient-type  deterministic  optimization 
algorithms),  as  well  as  decreasing  step-size  algorithms  (e.g.  stochastic  approximation 
algorithms) . 

The  conditions  for  convergence  typically  contain  requirements  on  the  frequency 
of  communications  between  processors.  These  conditions  may  be  interpreted  as 


guidelines  for  designing  the  communication  flows  in  a  decentralized  system  (be  it 
a  human  organization  or  a  parallel  computer)  in  a  way  that  guarantees  smooth 
operation  of  the  system.  We  also  apply  our  results  in  algorithms  for  decentralized 
identification  of  dynamical  systems. 

The  last  chapter  contains  an  overview  of  our  study,  some  conclusions  and  sug¬ 
gests  future  research  directions. 

1.3  CONTRIBUTIONS  OF  THIS  REPORT 

In  this  Section  we  list  the  contributions  of  Chapters  3,4  and  5  which  contain 
our  technical  results. 

Chapter  3 

0  We  show  that  the  discrete  versions  of  the  basic  (static)  problems  of  decentralized 
decision  making  (including  the  well-known  team-decision  problem  [Marschak  and  Radner, 
1972])  are  algorithmically  hard  (NP-complete ,  or  worse),  even  though  the  corresponding 
centralized  problem  is  trivial.  We  also  obtain  complexity  results  for  several  special 
cases . 

°  We  show  that  the  problem  of  decentralized  hypothesis  testing  [Tenney  and  Sandell, 
1981]  is  algorithmically  hard,  if  a  certain  simplifying  assumption  of  Tenney  and 
Sandell  is  removed. 

°  We  show  that  the  problem  of  designing  an  optimal  communications  protocol  is 


algorithmically  hard,  even  in  the  simplest  setting. 


Chapter  4: 

0  We  show  that  a  set  of  processors  (with  identical  models  and  cost  function,  but 
different  on-line  information)  who  exchange  optimal  (given  their  information) 
decisions  will  asymptotically  converge  to  consensus,  under  certain  assumptions. 
This  is  true  even  if  the  processors  have  limited  memory,  so  that  they  may  forget 
some  of  the  information  they  had  acquired  during  the  operation  of  this  scheme. 
These  results  generalize  significantly  earlier  results  of  Aumann  (1976] , 
Geanakopoulos  and  Polemarchakis  [1978],  Borkar  and  Varaiya  [1982]. 

0  We  obtain  a  new  decomposition  algorithm  for  solving  linear  estimation  problems 
and  prove  that  it  converges  exponentially. 


Chapter  5; 

0  We  develop  a  model  for  asynchronous  decentralized  computation  which  generalizes  an 
earlier  model  of  Bertsekas  [1982,1983].  This  model  allows  different  processors 
either  to  specialize  in  updating  a  component  assigned  to  them;  or,  they  may  "overlap" 
so  that  many  of  them  update  the  same  component  of  a  decision  vector.  A  novel  feature 
of  this  model  is  that  it  allows  us  to  associate  with  the  decentralized  algorithm  am 
aggregate  state  of  computation. 

°  We  prove  convergence  (to  the  centralized  optimal)  of  asynchronous  decentralized 
versions  of  a  broad  clams  of  deterministic  and  stochastic  optimization  algorithms 
(with  either  constant  or  decreasing  step-size)  under  conditions  which  are  not  sig- 
nificantly  stronger  them  those  required  for  the  centralized  counterparts  of  our 


results  [Poljak  and  Tsypkin,  1973] . 
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°  We  prove  convergence  results  for  decentralized  stochastic  algorithms  driven  by 
correlated  noise,  analyzed  via  the  ODE  approach  [Ljung,  1977a]. 

°  We  study  gradient-type  deterministic  optimization  algorithms  for  an  additive  cost 
function  in  which  a  different  processor  is  assigned  to  a  different  term  of  the  cost 
function.  We  show  that  convergence  is  guaranteed  if  the  frequency  of  communications 
between  two  processors  is  proportional  to  the  degree  of  coupling  between  their 
subproblems . 

°  We  apply  our  results  and  prove  convergence  of  certain  decentralized  algorithms  for 
identification  of  dynamical  systems. 

°  Leaving  technical  details  aside,  a  main  contribution  of  Chapter  5  is  that  a  novel 
way  has  been  found  to  associate  conmunication  requirements  with  the  smooth  operation 
I  of  a  class  of  decentralized  systems. 


CHAPTER  2;  A  DISCUSSION  OF  DECENTRALIZED  SYSTEMS 

In  this  Chapter  we  discuss  certain  general  issues  concerning  decentralized 
systems.  Our  discussion  has  to  be  limited,  however,  due  to  the  fact  that  there 
is  a  great  variety  of  real  world  decentralized  systems.  For  example,  a  decentral¬ 
ized  controller  for  a  large  power  system,  a  large  company  or  a  special  purpose 
multiprocessor,  have  enough  differences  so  as  to  render  a  unified  analysis  impossible 
On  the  other  hand,  very  different  systems  may  have  substantial  similarities,  so  that 
an  abstract  viewpoint  may  be  useful.  (See  [Arrow  and  Hurwicz,  1960],  for  example.) 

In  Section  2.1  we  start  with  a  definition  of  a  decentralized  system  (in  order 
to  fix  the  terminology)  and  discuss  some  qualitative  features  that  may  be  present. 

We  comment  on  the  nature  of  the  problem  of  designing  a  decentralized  system  and 
suggest  that  the  design  may  be  carried  out  by  another  decentralized  system  which  is 
distinct  from  the  first. 

Section  2.2  takes  a  brief  and  schematic  look  at  existing  organizations ,  so  as  to 
identify  some  types  of  problems  that  may  be  addressed. 

Section  2.3  focuses  on  the  role  of  communications,  since  these  are  the  cause 
of  the  main  differences  between  centralized  and  decentralized  systems.  This  Section, 
together  with  the  preceding  one,  serves  as  an  abstract  motivation  for  the  Chapters 
that  follow. 

Section  2.4  surveys  some  of  the  literature  related  to  decentralized  systems. 

2.1  DECENTRALIZED  SYSTEMS  AND  THEIR  DESIGN 


language  for  the  discussion  that  follows.  For  this  reason,  we  avoid  the  discussion 


of  continuous -time  decentralized  systems  which  may  lead  to  non-trivial  well-posedness 
questions,  which  need  not  concern  us  here. 

From  an  abstract  mathematical  perspective,  a  decentralized  system  is  simply 
an  interconnected  system  in  which  we  give  certain  particular  interpretations  to  the 
interconnection  variables  involved.  Formally,  we  have  (see  Figure  2.1.1): 


1.  An  environment  module  M.  and  a  set  V  =  {M, , . . . ,M„}  of  control  modules. 

0  IN 

2.  A  directed  graph  G= (V,E)  whose  nodes  cure  the  control  modules.  The  arcs  in 
this  graph  indicate  the  allowed  directions  of  communications . 

3.  For  any  (i,j)e  E,  let  be  a  channel  module,  through  which  messages  from 

M.  to  M.  are  being  transmitted. 
i-D 

4.  Let  x. (t) ,  x. . (t)  denote  the  state  of  module  M. ,  M. .,  respectively,  at 

X  1J  XX] 

time  t,  (t  belongs  to  a  discrete  index  set  T)  assumed  to  lie  in  some  set 


(state  space)  X^,  X_ ,  respectively. 

5.  Let  u^(t)  denote  the  control  (decision)  applied  by  module  M^,  i/0,  on  the 

environment,  at  time  t,  assumed  to  lie  in  some  set  U. . 

x 


6. 


7. 


Let  y^(t)  denote  the  observation  (measurement)  on  the  environment,  obtained  at 
time  t  by  module  M^,  i*0,  assumed  to  lie  in  some  set  Y^. 

Let  w^>  wij  he  disturbances  which  affect  the  operation  of  modules  M^,  , 

respectively,  belonging  to  sets  VT,  ,  respectively.  Here  w^  captures  the 
uncertainty  in  the  environment;  w^ ,  ij<0,  captures  the  uncertainty  in  the  inner 


operation  of  control  module  M 


^  ;  w  captures  the  uncertainty  in  the  communication 


process  (message  distortions  and  delays).  (Following  the  conventions  of 


probability  theory,  the  disturbances  cure  not  indexed  by  time;  so  the  sets 


9. 


VL  ,  W  cure  most  likely  to  be  sets  of  sample  paths.  Of  course,  no  pro¬ 
babilistic  assumptions  are  made  at  this  point.) 
a  r 

Let  nt  j,(i,j)eE,  denote  the  messages  sent  from  to  at  time  t,  and 

received  by  from  HT  ,  at  time  t,  respectively.  These  are  assumed  to 
belong  to  sets  M. . . 

Finally,  we  have  certain  relations  between  the  above  introduced  variables 
which  specify  the  evolution  of  the  internal  states  of  each  module  as  well  as 
the  interactions  between  modules: 


XQ(t+l)  =  $o(t,xQ(t)  ,ux(t) ,  ...'UgO:)  ,wq) 

x.  (t+1)  =  $.  (t,x. (t) ,u. (t) ,y.  (t)  ,nr  (t) , . . . ,m*  (t) ,w. ) ,  i^O, 

1  1  X  X  1  11  NX  X 

Xij(t+1)  35  (t'xij  {t)  'mij  (t)  ,wij) ' 

u.  (t)  =  ip (t,x  (t)  ,w  )  ,  ifO, 
i  10  i  r 

y.  (t)  =  ip  .  (t,x  (t)  ,u.  (t)  ,w  ) ,  i/0, 

i  oi  o  i  o 

mij(t)  =  ^ij (t'Xi(t) 'yi(t) ,ui(t) * '  i'^0' 

mf  j  (t)  =  (t,xi;.  (t)  ,mi;.  (t)  ,*1^) ,  i,  j^O, 

Equations  (2.1.1)- (2.1.7)  implicitly  assume  the  following  sequence  of 
at  time  t: 

(i)  Controls  are  applied. 

(ii)  Observations  are  obtained. 


(2.1.1) 

(2.1.2) 

(2.1.3) 

(2.1.4) 

(2.1.5) 

(2.1.6) 

(2.1.7) 

events 


(iii) 


Messages  are  transmitted 


(Of  course,  the  above  assumed  sequence  of  events  is  a  matter  of  convention.) 


General  Features  of  Decentralized  Systems 

We  now  discuss  briefly  some  features  which  may  (but  need  not)  be  present  in 
a  decentralized  system. 

1.  The  environment  may  be  modelled  as  an  interconnected  system,  or  it  may  be 
completely  absent.  If  it  is  an  interconnected  system,  its  topology  may  coincide 

or  may  be  different  than  the  topology  of  the  interconnections  of  the  control  modules. 
The  case  of  coinciding  topologies  creates  some  additional  structure,  which  may  be 
mathematically  exploited,  and  has  been  emphasized  in  the  literature;  however,  such 
a  coincidence  is  not  a  logical  necessity. 

2.  -  The  controls  applied  by  the  control  modules  may  influence  the  evolution  of  the 
state  of  the  environment.  They  may  also  influence  the  observations  directly,  via 
equation  (2.1.5).  This  allows  us  to  capture  actions  of  a  special  type  which  initiate 
a  measurement  process,  without  affecting  the  environment.  In  other  words,  the  action 
of  a  control  module  may  determine  whether  a  measurement  will  be  obtained  or  not,  and 
of  what  quality. 

3.  Disturbances  w^  on  control  modules  may  reflect  processor  failures,  rounding-off 
errors  or  bounded  rationality,  when  the  modules  represent  humans. 

4.  If  at  time  t  no  observation  is  made  by  AT,  we  may  let  y^(t)  be  equal  to  a  special 
symbol  indicating  this  absence.  So,  it  is  not  required  that  an  observation  be  made 

at  each  time  instance.  The  same  convention  may  be  applied  concerning  message  trans¬ 


missions  and  receptions. 


5. 


Observations  obtained  or  messages  received  by  module  may  be  remembered  or 
forgotten.  (Such  effects  can  be  captured  by  (2.1.2).) 

6.  The  channel  modules  are  often  fairly  simple.  For  example,  a  time  delay 
together  with  some  uncertain  distortion  (channel  noise) . 

7.  Equations  (2.1.4)- (2.1.7)  have  been  written  so  that  -  if  there  are  no  com¬ 
munication  delays  and  no  channel  noise  -  any  observation  obtained  by  module  may 

become  immediately  available  to  any  other  module  M . .  In  that  case ,  all  modules 

3 

may  make  the  next  decision  on  the  basis  of  common  information. 

r  s 

8.  A  message  m_  (t)  may  influence  x^ (t+1)  which  in  turn  may  influence  m^  (t+1)  . 

This  allows  us  to  model  a  situation  in  which  a  message  contains  a  request  for 
certain  pieces  of  information. 

9.  A  control  module  may  have  certain  special  state  variables  which  remain  cons¬ 
tant  (equal  to  their  initial  values)  but  nevertheless  influence  the  evolution  of 
the  remaining  state  variables .  Such  special  variables  may  be  viewed  as  parametriza- 
tions,  "set-points"  of  a  controller  or,  for  modules  representing  humans,  they  may 
capture  the  capabilities,  beliefs  or  prior  knowledge  of  human  decision  makers. 

10.  The  local  disturbances  w.  to  control  module  M.  may  influence  the  time  at  which 

i  i 

certain  messages  are  transmitted  or  certain  actions  are  undertaken.  This  allow  us 
to  model  systems  which  operate  without  any  strict  synchronization. 

11.  The  evolution  functions  of  either  the  environment  or  the  control  modules  may 
be  dynamic  (next  state  depends  on  current  one)  or  not. 


Classification  of  Decentralized  Systems 


We  saw  above  that  a  variety  of  qualitative  phenomena  may  arise  in  a  decentral¬ 
ized  system.  If  all  of  them  are  concurrently  present,  no  meaningful  analysis  seems 
possible.  For  this  reason,  it  may  be  useful  to  classify  decentralized  systems  in 
a  qualitative  way.  A  general  classification  is  possible  along  the  following  lines, 
starting  from  a  higher  and  proceeding  to  a  lower  level: 

(i)  According  to  the  topology  of  the  modules . 

(ii)  According  to  the  presence  or  absence  of  certain  variables. 

(iii)  According  to  which  variables  explicitly  influence  other  variables. 

(iv)  According  to  the  nature  of  these  influences. 

The  Performance  of  a  System 

We  assume  that  a  system  exists  in  order  to  accomplish  some  task,  perform  some 

function  or  control  the  environment  in  a  specific  way;  that  is,  there  exists  a 

well-defined  organizational  objective.  Otherwise,  no  design-oriented  analysis 
would  be  meaningful.  This  does  not  exclude  the  possibility  that  certain  control 
modules  have  their  own  interests,  or  that  they  have  incorrect  models  of  the  other 
modules.  Nevertheless,  no  matter  what  each  module  "believes"  to  be  true,  the 
system's  analyst  has  to  postulate  the  existence  of  a  true,  objective  model  which 
describes  the  evolution  of  the  system.  Moreover,  this  evolution  is  to  be  compared 
to  a  "desired"  one,  to  see  whether  certain  performance  criteria  are  met.  What  may 
be  desirable  from  the  analyst's  perspective  may  be  quite  distinct  from  what  is 


desirable  by  the  modules 


There  is  an  abundance  of  qualitatively  different  ways  of  measuring  the 
performance  of  a  system.  For  example: 

1.  The  performance  criterion  may  depend  on  the  temporal  evolution  of  the  state 
variables  of  the  environment  and/or  control  modules.  In  particular,  it  may  depend 
on  the  entire  history  of  the  state  variables  (e.g.  control  theory  with  additive 
costs) ,  it  may  be  asymptotic  (e.g.  the  stability  requirement  in  adaptive  control) 

or  it  may  depend  only  on  a  final  state  (e.g.  the  performance  of  a  parallel  algorithm 
may  be  judged  in  terms  of  the  termination  time) . 

2.  A  cost  may  be  incorporated  which  is  related  to  the  size  of  the  state  spaces 
of  the  control  modules  (so  as  to  reflect  the  cost  of  memory)  or  to  the  complexity 
of  the  evolution  equations  of  the  control  modules  (to  reflect  computation  costs 
during  implementation) . 

3.  There  may  be  costs  depending  on  the  values  of  the  controls  being  applied  (e.g. 
a  penalty  on  energy  being  used) ,  or  costs  of  obtaining  measurements . 

4.  Finally,  there  may  be  costs  associated  to  the  process  of  communications.  For 
example,  each  message  transmitted  could  result  to  a  penalty,  possibly  depending  on 
the  identities  of  the  transmitter  and  the  receiver,  as  well  as  on  the  length  of  the 
message. 

Note  that  the  above  mentioned  aspects  1,2,3  of  the  cost  criterion  are  also 
relevant  to  systems  with  a  single  control  module.  So,  it  is  the  fourth  aspect  which 
should  be  expected  to  lead  to  qualitatively  new  issues. 
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Typically,  not  all  of  the  design  issues  are  included  in  the  performance 
criterion.  Rather,  a  design  is  performed  in  terms  of  a  narrow  criterion  and,  then, 
the  designer  has  to  check  whether  certain  side-concerns  are  adequately  handled. 

Some  common  side-issues  are  related  to  robustness,  sensitivity  and  adaptivity  to 
small  and/or  structural  unmodelled  variations. 

On  Design  Problems 

We  start  with  the  premise  that  we  are  not  just  interested  in  analyzing  the 
performance  of  a  given,  fixed  decentralized  system;  rather,  we  want  to  design  one. 

We  must  therefore  distinguish  those  elements  of  the  system  which  are  given  from 
those  which  are  amenable  to  design. 

Formally,  a  system  S  may  be  viewed  as  an  ordered  pair  (a, 3),  where  a, 3 
represent  the  fixed  and  variable  elements  of  the  system,  respectively.  We  assume 
that,  for  any  a,  we  are  given  a  set  B(oc)  of  admissible  choices  of  the  manipulable 
parts  of  the  system.  It  is  often  helpful  to  think  about  many  systems  simultaneously , 
which  motivates  the  next  definition: 

Definition  2.1.1:  A  decentralized  scheme  is  a  collection  { (a, 3 (a) ) :a6A}  of 
decentralized  systems,  where  3  is  some  function  0:A  -*■  B(a). 

Essentially,  3  yields  an  admissible  design  for  any  value  of  the  fixed  elements 
of  the  system.  For  example,  if  A  is  the  set  of  linear  systems,  let  the  map  3  assign 
a  stabilizing  compensator  to  each  a€A,  according  to  a  well-defined  rule. 

Note  that,  typically,  the  separation  of  a  system  into  a  and  3_parameters  follows 
a  certain  pattern: 

•(i)  The  environment  dynamics,  the  effects  of  the  controls,  the  environment  distur¬ 
bances  and  the  properties  of  the  channels  are  usually  fixed. 


(ii)  The  controller  dynamics  and  the  mechanisms  generating  messages  and  controls 
are  usually  free  to  be  designed. 

Let  us  now  discuss  design  problems.  Suppose  that  we  are  given  a  set  A  of 
parameters ;  for  each  a€A,  a  set  B(a)C  B  of  admissible  designs;  a  performance 
criterion  J:A  x  B  -*■  R.  The  following  questions  may  be  raised: 

(i)  Given  a,  3#  what  is  the  value  of  J(a,3)? 

(ii)  For  a=a  ,  find  3*6B(a  )  which  minimizes  j (a  ,3)#  over  all  36B(a  )• 

o  o  o  o 

(iii)  Find  3*:  A-*B  such  that  for  any  fixed  aeA,3*Ca)  solves  problem  (ii) . 

Let  us  now  focus  on  questions  (ii)  and  (iii) .  They  are  different  in  the  same  way 

2 

that  the  following  two  problems  are  different:  (a)  Minimize  (3—2 )  ;  (b)  Minimize 
2 

(3~a)  .  Question  (ii)  corresponds  to  a  case  study:  designing  a  specific  system; 
question  (iii)  corresponds  to  designing  a  scheme.  Although  applications  alwav-; 
concern  specific  systems,  theoretical  analysis,  in  order  to  be  general  and  useful, 
must  focus  on  questions  of  type  (iii) .  We  are  thus  led  into  the  next  issue:  what 
constitutes  an  answer  to  question  (iii) ?  A  closed  form  representation  or  a  listing 
of  the  values  of  3*  in  a  table  is  usually  impossible.  The  only  alternative  left  is 
for  the  answer  to  consist  of  a  list  of  instructions  that  may  be  followed  to  construct 
3* (a)  from  a;  that  is,  an  algorithm  which  evaluates  3* •  The  instructions  of  any 
such  algorithm  will  consist  of  some  elementary  instructions  (or  operations)  which 
are  assumed  to  be  readily  implementable ,  say  in  the  form  of  a  subroutine  call. 

(What  we  call  elementary  instructions  here  may  be  at  a  very  high  level  when  compared 
to  the  elementary  instructions  considered  in  the  theory  of  computation.)  For  example, 


the  dynamic  programming  algorithm  uses  the  instruction  u:  =  arg  min^gfu)  and 
the  gradient  algorithm  for  unconstrained  optimization  uses  the  instruction 
A:  =  (3g/3u)(u).  (So,  the  instruction  assumed  to  be  available  for  the  solution 
of  one  problem  may  be  the  problem  itself  in  another  setting).  Accordingly,  what 
constitutes  an  answer  to  question  (iii)  depends  strongly  on  the  set  of  elementary 
instructions . 

We  argued  above  that  a  decentralized  system  (or  scheme)  is  to  be  designed  by  an  alg 
rithm,  possibly  a  distributed  one.  Notice  that  such  algorithms  cure  special  cases  of 
(decentralized)  systems,  under  our  definition.  We  obtain,  therefore,  a  hierarchy : 

(i)  A  system  being  implemented.  (Lower  level.) 

(ii)  A  system  which  yields  as  final  output  the  system  to  be  implemented. 

(Higher  Level.) 

1.  The  two  systems,  at  the  different  levels,  are  distinct  and  should  not  be 
confused,  even  though  they  are  both  related  to  the  same  situation. 

2.  Each  of  the  systems  in  (i)  and  (ii)  above  may  be  either  centralized  or 
decentralized,  thus  leading  to  four  different  combinations. 

3.  It  makes  a  large  difference  whether  the  lower  level  is  centralized  and 
the  higher  decentralized,  or  the  converse  is  true.  (In  the  context  of  linear  control, 
the  first  case  has  been  called  "decomposition"  and  the  second  "decentralization " 
by  Sandell  [1976].) 

4.  If  the  systems at  both  levels  are  decentralized,  the  two  topologies  may 
coincide,  although  this  is  not  necessary.  (In  most  of  the  literature  on  hierarchical 
control  tfoahmoud,  1977],  the  two  topologies  are  assumed  to  be  identical.  This  may 


lead  to  some  conceptual  confusion,  as  it  may  be  uncle ax  which  of  the  two  decentral¬ 
ized  systems  is  being  referred  to.  Findeisen  [1982]  clarifies  this  distinction  by 
talking  about  the  "programming”  and  "execution"  phases.) 

5.  The  lower  and  higher  level  systems  are  interrelated:  a)  If  we  change  the 
design  problem  for  the  lower  level  system,  a  different  high  level  system  (more 
complex  or  less  complex)  could  be  used.  Typically,  one  is  willing  to  sacrifice  some 
performance  for  the  lower  level  system,  if  this  results  to  a  less  complex  design 
algorithm  (higher  level  system) .  However,  such  tradeoffs  are  often  extremely  hard 
to  quantify,  b)  Once  the  higher  level  system  has  terminated  its  operation,  its  out¬ 
put  must  be  transmitted  to  certain  locations  (the  control  modules  of  the  lower  level 
system) .  These  transmissions  may  entail  a  certain  cost  which  is  not  pertinent  to 
the  higher  or  lower  level  system,  but  rather  to  the  fact  that  the  second  must  start 
from  where  the  first  has  stopped,  c)  It  is  also  conceivable  that  the  two  systems 
referred  to  above,  operate  concurrently.  For  example,  an  organization  may  change  its 
structure  while  operating. 

2.2  ON  REAL  WORLD  ORGANIZATIONS 

There  are  many  types  of  organizations  but  all  of  them  are  decentralized  systems 
in  our  terminology.  We  choose  to  restrict  to  those  organizations  in  which  there  are 
certain  people  who  have  a  direct  interest  in  the  performance  of  the  organization  and 
also  have  substantial  authority  for  making  changes .  In  our  terminology,  they  cure 
faced  with  the  problem  of  designing  a  decentralized  system.  This  problem  being  often 
intractable ,  very  few  organizations  have  been  structured  according  to  a  grand-design 


which  started  from  scratch.  Rather,  the  design  process  was  itself  a  decentralized 
system,  possibly  operating  concurrently  with  the  organization  itself.  Such  a 
design  is  often  incremental  in  two  ways: 

(i)  Changes  in  the  mode  of  operation  of  each  module  are  incremental. 

(ii)  Modules  are  added  to  the  organization  incrementally.  (An  organization 
typically  grows  from  smaller  to  larger) . 

Due  to  this  incrementalism,  the  current  shape  of  an  organization  can  only  be 
understood  in  terms  of  its  past  history.  ( Organizational  design  problems  may  have 
many  local  optima;  the  final  design  will  therefore  depend  on  the  starting  point  and 
the  steps  that  were  followed) .  The  following  abstract  history  of  evolution  may  be 
hypothesized  [Galbraith,  1977]: 

1.  In  the  beginning,  we  have  a  small  organization  in  which  each  module 
operates  in  a  fixed  manner,  which  may  be  assumed  to  be  close  to  optimal.  Many 
actions  are  undertaken  without  communicating  but  whenever  a  module  needs  some  in¬ 
formation  from  other  modules,  it  may  obtain  it  by  communicating. 

2.  As  the  organization  grows,  it  ceases  being  optimal  and  some  modules  start 
changing  their  mode  of  operation  so  as  to  improve  performance.  As  their  operation 
changes,  they  need  -and  do  -  inform  those  modules  who  should  be  concerned  about  such 
changes . 

3.  For  an  even  larger  organization,  it  becomes  hard  for  each  module  to  know 
who  is  concerned  about  what.  So,  higher  levels  are  introduced  who  do  not  know 
details  about  the  lower  level  modules,  but  have  a  global  picture  and  know  who  should 
be  concerned  about  what.  They  act  as  coordinators  by  setting  up  the  necessary 


information  flows.  Also,  in  case  that  the  interests  of  the  lower  level  modules 
divert  themselves  from  the  interests  of  the  organization ,  the  higher  level  designs 
a  reward  scheme  so  as  to  bring  these  interests  as  much  in  line  as  possible. 

The  above  procedure  does  not  need  to  lead  to  an  optimal  design.  However,  the 
global  problem  being  very  hard,  the  above  incremental  procedure  has  to  be  accepted. 
What  remains  to  be  done,  from  a  normative  perspective,  is  to  ensure  that  each  step 
of  the  above  procedure  is  carried  out  in  a  rational  way,  as  best  as  possible.  The 
above  outline  may,  therefore,  serve  as  a  general  guide  on  what  kinds  of  problems 
should  be  addressed. 

2.3  ON  COMMUNICATIONS  AND  ASSOCIATED  PROBLEMS 

Many  of  the  aspects  of  decentralized  systems  are  also  present  in  centralized 
ones;  therefore,  they  are  not  the  prime  subject  of  a  theory  of  decentralized  systems. 
The  main  aspect  which  is  unique  to  decentralized  systems  is  the  distribution  of 
information  and  communications  and  these  should  be  the  focus  of  theoretical 
investigations . 

Communications  are  unimportant  if  they  are  instantaneous,  unconstrained,  not 
penalized  and  not  corrupted  by  noise:  in  such  a  case,  an  optimal  centralized  design 
coincides  with  an  optimal  decentralized  design.  Communications  become  important  only 
if  one  of  the  above  assumptions  is  violated.  Formally,  communications  add  just 
another  component  to  the  cost  function  and  some  new  constraints .  Difficulties  arise, 
however,  because  typically  these  new  costs  and  constraints  are  qualitatively  different 
from  the  costs  and  constraints  associated  to  the  operation  of  the  rest  of  the  system. 
This  may  make  decentralized  versions  of  otherwise  easy  problems  particularly  hard 


(see  Chapter  3) . 
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For  these  reasons  we  need  some  understanding  of  basic  decentralized  problems 
in  which  communications  enter  in  a  simple  way.  Such  simple  problems  may  yield  some 
insight  for  more  complex  ones,  or  may  be  used  as  building  blocks.  From  the  mathe¬ 
matical  point  of  view,  it  is  important  to  discover  new  ways  of  incorporating  com- 
munications  costs  and  cortstraints  into  decentralized  versions  of  centralized  problems 


in  a  sufficiently  simple  fashion  so  that  analysis  does  not  become  impossible 


we  may  proceed  to  address  fundamental  questions  such  as:  "who  should  communicate  what 
to  whom  and  when,  so  as  to  guarantee  that  a  decentralized  system  performs  as  desired? 


2.4  LITERATURE  SURVEY 

There  are  many  disciplines  relevant  to  the  analysis  or  design  of  decentralized 
systems:  for  example ,  decision  theory,  game  theory,  control  theory,  mathematical 

programming,  organization  theory  and  computer  science.  The  associated  bibliography 
runs  in  the  thousands  and  a  comprehensive  survey  is  impossible.  In  this  Section,  we 
discuss  some  representative  lines  of  research,  focusing  on  those  areas  which  bear 
some  relation  to  our  work.  More  specific  references  may  be  also  found  in  the  main 
body  of  Chapters  3,4  and  5. 

Bayesian  Team  Decision  Theory  and  Decentralized  Control 

Team  decision  theory,  in  its  original  -  static  -  version  [Radner,  1962;  Marschak 
and  Radner,  1972]  deals  with  the  following  problem:  a  set  of  processors  obtain 
measurements  of  a  stochastic  environment;  then,  each  processor  makes  a  decision, 
according  to  a  decision  rule,  based  on  its  own  set  of  measurements  only,  i.e.  without 


communicating.  Then  a  cost  is  incurred,  depending  on  the  decisions  of  each  processor 


and  the  state  of  the  environment.  The  problem  consists  of  designing  the  decision 
rule  of  each  processor,  so  as  to  minimize  the  expected  cost.  A  solution  may  be 
easily  obtained  for  certain  special  cases  in  which  we  may  restrict  to  a  set  of 
decision  rules  admitting  simple  parametrizations:  for  example,  linear  quadratic 
gaussian  (LQG)  problems  [Marschak  and  Radner,  1972]  or  linear  gaussian  problems  with 
an  exponential  cost  criterion  [Krainak,  Speyer  and  Marcus,  1982a,  1982b;  Speyer, 

Marcus  and  Krainak ,  1980],  In  both  of  the  above  cases,  optimal  decision  rules  turn 
out  to  be  affine  functions  of  the  measurements  of  each  processor.  Very  little, 
however,  can  be  said  for  the  general  version  of  the  static  team  problem.  (See  Sec¬ 
tion  3.4  for  some  results  on  the  complexity  of  this  problem.)  An  interesting  result 
has  been  obtained  by  Arrow  and  Radner  11979]  who  have  shown  that  the  law  of  large 
numbers  may  be  exploited  to  yield  a  simple  approximate  solution  for  a  class  of  team 
problems  involving  a  very  large  number  of  processors.  Also,  Witsenhausen  [1981]  has 
derived  a  lower  bound  for  a  class  of  team  problems. 

The  static  problem  considered  thus  far  admits  a  multi-stage  (or  continuous  time) 
extension  [Ho  and  Chu,  1972;  Singh,  1981]:  the  dynamic  team  or  optimal  decentralized 
control  problem.  Most  studies  have  focused  on  the  LQG  case  [Ho  and  Chu,  1972;  Chu, 
1972;  Ho  and  Chang,  1980;  Ho,  1980] ,  since  the  general  case  is  much  harder,  even 
for  static  problems.  The  main  results  are  in  a  sense  negative:  a  solution  may  be 
easily  found  only  in  the  presence  of  "partially  nested  information  structures",  that 
is  when  a  condition  of  the  following  type  is  satisfied:  If  the  decision  of  processor 
A  at  time  t  influences  the  measurements  of  processor  B  at  time  T  (T>t) ,  then  the 
information  of  B  at  time  T  contains  the  information  of  processor  A,  at  time  t.  In 
the  absence  of  such  a  condition,  Witaenhauaen  11968]  has  constructed  a  simple  two-stage 


example  showing  that  optimal  decision  rules  may  be  nonlinear  functions  of  the 
information.  Even  if  linearity  of  decision  rules  is  imposed  as  a  constraint,  the 
optimal  decision  rules  may  correspond  to  infinite-dimensional  compensators  [Barta, 
1978].  Certain  special  information  structures  are  easier  to  study.  For  example, 
the  control- sharing  information  pattern  [Sandell  and  Athans,  1974]  and  the  one-step 
delay  information  pattern  [Sandell  and  Athans,  1974;  Kurtaran  and  Sivan,  1974; 

Bagchi  and  Basar,  1980;  Varaiya  and  Walrand,  1978;  Krainak  et.al.,  1982c]  for  which 
simple  solutions  can  be  found. 

The  difficulties  in  the  dynamic  team  problem  may  be  understood  in  different 
ways.  We  first  have  the  "second  quessing"  effect:  each  processor  makes  decisions 
by  guessing  the  decisions  of  other  processors,  hence  guess  their  guesses  and  so  on, 
ad  infinitum,  so  that  infinite  dimensional  compensators  are  obtained  for  finite 
dimensional  systems.  Second,  there  is  the  possibility  of  using  the  system  being 
controlled  as  a  channel  for  communicating  information,  by  judicious  choice  of  the 
control  variables  (signalling)  [Sandell  and  Athans,  1974;  Ho,  Kastner  and  Wong,  1978]. 
From  a  more  mathematical  viewpoint,  the  root  of  the  difficulty  lies  in  the  fact  that 
the  distribution  of  information  and  the  prohibition  of  communications  corresponds  to 
a  special  type  of  constraint  on  admissible  decision  rules,  which  is  hard  to  handle 
[Ho  and  Chang,  1980.] 

A  way  out  of  these  difficulties  has  been  taken  by  restricting  the  admissible 
compensators  (decision  rules)  to  have  a  fixed  structure  [Chong  and  Athans,  1971;  Looze 
et.al.,  1978].  For  the  infinite  horizon  problem,  we  obtain  a  nonlinear  parametric 
optimization  problem  which  can  be  solved  numerically  [Looze  et.al.,  1978;  Geromel  and 
Bemussou,  1979];  however,  it  is  a  non-convex  problem  that  may  possess  many  local  minima 


A  higher  level  problem,  which  has  been  little  studied,  consists  of  designing 


the  information  structure,  subject  to  certain  constraints  [Chu,  1978;  Ho  and 
Papadopoulos ,  1979;  Papavassilopoulos,  1983]. 

Once  the  pursuit  of  optimality  is  abandoned,  we  may  pose  the  problem  of  designing 
a  decentralized  compensator  which  stabilizes  a  given  system.  Several,  and  fairly 
complete  results  have  been  obtained  for  this  problem  [Wang  and  Davison, 197 3;  Saeks, 
1979;  Anderson  and  Clements,  1981;  Davison  and  Ozguner,  1983;  Sezer  and  Siljak,  1981], 
some  of  them  surprising  [Anderson  and  Moore,  1981].  Finally,  if  the  model  of  the  sys¬ 
tem  to  be  controlled  is  itself  unknown,  one  may  consider  the  possibility  of  decentral¬ 
ized  adaptive  control,  on  which  some  preliminary  results  have  been  reported  [Ioannou 
and  Kokotovic,  1983]. 

More  references  in  this  general  area  can  be  found  in  the  survey  papers  of 
Sandell  et.al.  [1978]  and  Siljak  [1983]. 

Decentralized  Estimation  and  Hypothesis  Testing 

Suppose  that  a  set  of  sensors  obtains  measurements  (as  in  the  team  problem)  and 
each  one  of  them  tries  to  estimate  some  random  vector  or  perform  a  hypothesis  test. 
Suppose  also  that  the  cost  criterion  couples  the  estimates  (or  decisions)  of  different 
sensors  by  penalizing,  for  example,  positive  correlation  of  their  errors.  The  problem 
of  designing  the  estimators  (decision  rules)  of  each  sensor  is  then  a  special  case  of 
the  team  problem;  the  main  difference  is  that,  now,  the  estimates  (decisions)  of  one 
sensor  do  not  affect  the  measurements  of  other  sensors,  thus  avoiding  the  possibility 
of  signalling.  Barta  [1978]  has  shown  that  the  best  estimates  for  linear  Gaussian 
dynamical  systems  can  be  obtained  by  finite-dimensional  filters  (resembling  to  the 
Kalman  filter) ,  although  of  a  much  larger  dimension. 


A  two-sensor  decentralized  hypothesis  testing  problem  has  been  studied  by 
Tenney  and  Sandell  [1981]  under  the  assumption  that  the  measurements  of  the  two 
sensors  are  statistically  independent  (conditioned  on  either  hypothesis) .  It  was 
shown  that,  similarly  to  the  centralized  case,  likelihood  ratios  provide  sufficient 
statistics;  the  optimal  thresholds  for  the  two  sensors  are  coupled  through  nonlinear 
algebraic  equations.  Most  interestingly,  it  was  shown  that  asymmetric  thresholds 
may  be  optimal  for  perfectly  symmetric  problems,  reflecting  hedging  behavior. 
Qualitatively  similar  results  have  been  also  obtained  for  the  decentralized  quickest 
detection  problem  [Teneketzis,  1982],  the  decentralized  sequential  hypothesis  testing 
(Wald)  problem  [Teneketzis,  1983],  as  well  as  problems  involving  communication  of 
zero-one  messages  from  certain  sensors  to  others  [Ekchian  and  Tenney,  1982] .  However, 
almost  all  available  results  depend  heavily  on  the  conditional  independence  assumption 
Section  3.3  shows  that  if  this  assumption  is  removed,  the  problem  becomes  very  hard. 

The  above  described  research  allows  either  no  communications  at  all,  or  the  com¬ 
munication  of  a  few  zero-one  messages.  Consequently,  the  final  estimates  are,  in 
general,  worse  than  the  estimates  that  would  be  obtained  if  the  processors  were  to 
exchange  their  detailed  information.  At  the  other  extreme,  we  have  schemes  in  which 
sufficient  communications  are  allowed,  so  that  optimal  centralized  estimates  are 
obtained.  Several  architectures  for  decentralized  Kalman  filtering  (or  smoothing) 
have  been  suggested  [Speyer,  1979;  Hassan  et.al.,  1978;  Chang,  1980;  Chong,  1979; 
Willsky  et.al.,  1982;  Levy  et.al.,  1983].  These  results  typically  boil  down  to  the 
following:  the  centralized  optimal  Kalman  filter  estimate  is  a  linear  function  of  the 
measurements  of  the  different  sensors  and  the  proposed  schemes  are  different  ways 
of  applying  the  superposition  principle. 


-29- 


Finally,  we  have  some  intermediate  schemes  in  which  real  numbers  are  being 
communicated,  but  without  necessarily  attaining  the  centralized  optimal  performance. 

The  scheme  of  Borkar  and  Varaiya  [1982]  is  an  example.  (These  results  are  signi¬ 
ficantly  extended  in  Chapter  4.)  In  the  scheme  of  Sanders  et.al.  [1974,  1978]  each 
sensor  produces  estimates  of  only  some  of  the  components  of  an  unknown  state  vector, 
corresponding  to  a  particular  subsystem.  Each  sensor  takes  interactions  from  other 
subsystems  into  account  using  either  noisy  observations  of  these  interactions,  or 
messages  -  possibly  corrupted  by  noise  -  from  other  sensors.  The  loss  of  optimality 
is  compensated  by  the  fact  that  simpler  filters,  requiring  fewer  computations  may 
be  used. 

Decentralized  (Parallel)  Computation 

In  decentralized  confutation,  several  processors  do  a  sequence  of  computations 
(in  parallel)  and  exchange  partial  results  until  they  obtain  a  desired  final  result. 
The  main  advantage  is  that  parallelism  may  reduce  the  time  required  for  obtaining 
the  final  result  (even  though  the  total  number  of  operations  involved  may  be  larger) . 

Arrow  and  Hurwicz  [1960]  have  shown  that  decentralized  confutation  is  closely 
related  to  decentralized  decision  making.  Their  inspiration  came  from  the  observation 
that  market  mechanisms  (which  can  be  viewed  as  decentralized  systems)  do,  under 
certain  assumptions,  minimize  a  social  cost  function.  Moreover,  they  have  indicated 
the  importance  of  the  compatibility  of  the  distribution  of  computation  with  the 
distribution  of  information.  The  analogy  between  optimization  algorithms  and  models 
of  rational  decision  making  has  been  carried  further,  especially  in  the  context  of 


resource  allocation  in  divisionalized  organizations.  [Moore,  1979;  Burton  and 
Obel,  1980].  Section  5.7  of  this  report  also  proceeds  along  these  lines.  - 

There  has  been  a  significant  volume  of  research  on  decentralized  algorithms 
for  several  classical problems,  in  which  each  processor  only  needs  to  know  a  subset 
of  the  input  data.  Such  algorithms  may  be  broadly  classified  into  synchronous  and 
asynchronous  ones . 

Synchronous  algorithms  may  be  viewed  as  alternative  implementations  of  central¬ 
ized  (serial)  algorithms.  In  general,  however,  it  is  not  true  that  a  decentralized 
implementation  of  a  good  serial  algorithm  is  also  a  good  decentralized  algorithm. 

This  leads  to  new  and  interesting  theoretical  questions  concerning  mainly  the  trade¬ 
offs  between  the  number  of  processors  used,  the  total  computation  time  required  and 
the  number  of  messages  that  have  to  be  exchanged.  Some  representative  decentralized 
synchronous  algorithms  have  been  developed  for  traditional  numerical  analysis  problems 
[Miranker,  1971],  for  optimal  routing  in  data  communication  networks  [Gallager,  1977], 
as  well  as  for  many  combinatorial  problems  [Borodin  and  Hopcroft,  1982;  Gallager,  1983; 
Ullman,  1984].  Trade-offs  have  been  extensively  studied  by  computer  scientists, 
because  they  have  consequences  on  the  practical  feasibility  of  solving  certain  types 
of  problems  by  VLSI  multiprocessors  [Yao,  1981]. 

From  a  more  theoretical  perspective,  the  amount  of  communications  required  to 
solve  a  discrete  (combinatorial)  problem  by  a  decentralized  algorithm  is  another 
complexity  measure,  similar  to  the  "time"  and  "space"  complexity  measures  for  serial 
computation.  Certain  general  results  have  been  obtained  by  Papadimitrou  and  Sipser 
[1982].  (See  also  [Aho,  Ullman,  Yannakakis,  1983]).  We  will  see,  however,  in  Section 
3.5  that  it  is  very  hard  to  determine  the  minimum  number  of  communications  required 
for  solving  a  given  problem. 


In  many  applications,  asynchronous  decentralized  algorithms  are  desirable, 
because  of  weaker  requirements  on  the  timing  of  computations  and  communications 
[Rung,  1976] .  Such  algorithms  reuse  theoretical  questions  as  to  how  much  asyn- 
chronism  may  be  tolerated,  while  maintaining  correctness  (convergence)  of  the  algo¬ 
rithm.  Bertsekas  [1982,1983]  has  obtained  general  results  for  the  successive 
approximations  algorithm  for  dynamic  programming  and  the  computation  of  fixed  points. 
Earlier  results  may  be  found  in  [Baudet,  1978;  Chazan  and  Miranker,  1969].  Chapter  5 
contains  several  results  on  asynchronous  decentralized  optimization  algorithms. 


Hierarchical  Decision  Making  and  Control 

Mesarovic,  Mako  and  Takahara  [1970]  have  suggested  a  general  and  formal  approach 
to  hierarchical  decision  making.  The  environment  is  assumed  to  be  an  interconnected 
system  and  a  processor  is  assigned  to  each  subsystem.  Each  such  processor  solves  a 
relatively  small  optimization  problem  derived  from  a  global  optimization  problem;  a 
higher  level  processor  (coordinator)  affects,  through  some  parameter,  the  structure 
of  the  lower  level  problems.  The  main  task  of  the  coordinator  is  to  find  a  value  for 
the  parameter  so  that  the  decisions  evaluated  by  the  lower  level  processors  are  op¬ 
timal  for  the  global  problem.  The  original  theory  was  very  abstract 
[Varaiya,  1972]  but  it  was  followed  by  much  research  [Mahmoud,  1977]  with  particular 
emphasis  on  infinite  dimensional  (e.g.  optimal  control)  problems.  It  seems  that  this 
theory  is  mostly  applicable  to  the  steady-state  control  of  dynamic  systems.  For 
stochastic  and  dynamic  optimization  problems,  the  hierarchical  decision  making  schemes 
that  have  been  proposed  are,  in  general,  not  optimal  because  they  either  incorporate 
some  sort  of  open-loop  feedback  or  because  they  restrict  the  set  of  admissible  control 


laws  in  an  ad-hoc  fashion  [Chong  and  Athans,  1976;  Forestier  and  Varaiya,  1978]. 


This  is  not  necessarily  undesirable,  because  real-world  decentralized  systems  are 
not  supposed  to  achieve  the  centralized  optimum,  but  only  to  approximate  it. 

However,  relatively  little  progress  has  been  made  in  obtaining  systematically  ap¬ 
proximately  optimal  decision  rules  which  are  easier  to  find  than  the  truly  optimal 
ones . 

Spatial  and  Time-Scale  Separation 

Many  large  scale  systems  consist  of  subsystems  which  are  weakly  coupled  or 
which  evolve  in  different  time  scales .  Such  a  structure  may  be  exploited  for  design¬ 
ing  decentralized  controllers  based  on  the  theory  of  singular  or  non- singular  pertur¬ 
bations  . 

A  non- singularly  perturbed  system  is  one  composed  of  weakly  coupled  subsystems 
(spatial  separation) .  An  approximate  description  can  be  obtained  by  neglecting  the 
interactions,  during  the  first  stage  of  the  analysis  [Simon  and  Ando,  1961;  Aoki, 

1968],  Singular  perturbations  correspond  to  time  scale  separation,  are  related  to  a 
relatively  rich  class  of  phenomena  and  need  much  more  sophisticated  mathematics  to 
be  analyzed  [Papanicolaou,  Stroock  and  Varadhan,  1977].  Many  results  have  been  obtainec 
with  applications  in  filtering  [Haddad,  1976]  and  control  [Teneketzis  and  Sandell, 

1977;  Kokotovic,  O'Malley  and  Sannuti,  1976]  of  linear  systems,  control  of  Markov 
chains  [Delebecque  and  Quadrat,  1981;  Phillips  and  Kokotovic,  1981]  and  multimodelling 
in  large  scale  system  [Khalil  and  Kokotovic,  1978]. 


Game  Theory  and  the  Desion  of  Incentives 


Game  theory  aims  at  characterizing  rational  behavior  in  the  presence  of  conflict 
[Von  Neumann  and  Margenstem,  1953;  Luce  and  Raiffa,  1957;  Vorobev,  1977;  Roth,  1979]. 

A  particular  class  of  games  is  related  to  incentive  system  design.  Here  the  higher 
level  of  an  organization  chooses  a  reward  scheme  and  the  lower  levels  make  decisions 
trying  to  maximize  their  individual  reward.  The  problem  consists  of  choosing  a 
reward  scheme  so  that  the  lower  level  decision  makers  end  up  minimizing  an  organiza¬ 
tional  cost  function.  In  the  language  of  Section  2.1,  the  incentive  design  problem 
is  essentially  a  problem  of  designing  a  decentralized  system  when  the  available  control 
modules  are  "selfish"  individuals  which  aim  at  maximizing  their  individual  rewards, 
rather  than  optimizing  the  performance  of  the  organization.  It  should  be  emphasized 
that  the  nature  of  results  obtained  for  this  problem  depends  heavily  on  the  set  of 
admissible  reward  schemes;  the  larger  this  set,  the  easier  it  becomes  to  force  the 
low-level  decision  makers  to  behave  in  a  desired  way  [Groves,  1973;  Groves  and  Loeb, 


1979;  Jennergren,  1980;  Ho,  Luh  and  Muralidharan ,  1981] 


CHAPTER  3:  DISCRETE  DECENTRALIZED  DECISION  PROBLEMS 


3.1  INTRODUCTION  AND  MOTIVATION 

In  this  Chapter  we  formulate  and  study  certain  simple  decentralized  problems. 

Our  goal  is  to  formulate  problems  which  reflect  the  inherent  difficulties  of  decentral¬ 
ization;  that  is,  any  difficulty  in  this  class  of  problems  is  distinct  from  the  dif¬ 
ficulty  of  corresponding  centralized  problems .  This  is  accomplished  by  formulating 
decentralized  problems  whose  centralized  counterparts  are  either  trivial  or  vacuous. 

One  of  our  goals  is  to  determine  a  boundary  between  "easy"  and  "hard"  decentral¬ 
ized  problems.  Our  results  will  indicate  that  the  set  of  "easy"  problems  is 
relatively  small. 

All  problems  to  be  studied  are  imbedded  in  a  discrete  framework;  the  criteria  we 
use  for  deciding  whether  a  problem  is  difficult  or  not  come  from  complexity  theory 
[Garey  and  Johnson,  1979;  Papadimitriou  and  Steiglitz,  1982]:  following  the  tradition 
of  complexity  theory,  problems  that  may  be  solved  by  a  polynomial  algorithm  are  consid¬ 
ered  easy;  NP-complete,  or  worse,  problems  are  considered  hard.*  However,  an  NP-complet- 
eness  result  should  not  be  thought  as  a  result  that  closes  a  subject,  but  rather  as  a 
result  which  can  redirect  research  efforts  to  heuristic  and  approximate  algorithms  or 
possibly  easier  special  cases.  It  should  be  also  stressed  here  that  NP-completeness  of 
a  discrete  problem  can  be  am  indication  that  the  corresponding  continuous  problem  does 
not  admit  an  analytical  solution  nor  am  efficient  numerical  solution  [Papadimitriou  and 
Tsitsiklis,  1984]. 

The  main  issue  of  interest  in  decentralized  systems  may  be  loosely  phrased  as 

"who  should  communicate  to  whom,  what,  how  often,"  etc.  From  a  purely  logical  point 

of  view,  the  first  question  that  has  to  be  raised  is  "are  there  amy  communications 

♦One  way  of  viewing  NP-complete  problems,  is  to  say  that  they  are  effectively  equivalent 
to  the  Traveling  Salesman  problem,  which  is  well-known  to  be  algorithmically  hard. 


necessary?"  Any  further  questions  deserve  to  be  studied  only  if  we  come  to  the 
conclusion  that  communications  are  indeed  necessary. 

The  subject  of  Section  3.2  is  to  characterize  the  inherent  difficulty  of  the 
problem  of  deciding  whether  any  communications  cure  necessary,  for  a  given  situation. 

We  adopt  the  following  approach:  a  decentralized  system  exists  in  order  to  accomplish 
a  certain  goal  which  is  externally  specified  and  well-known.  A  set  of  processors 
obtain  (possibly  conflicting)  observations  on  the  state  of  the  environment.  Each 
processor  has  to  make  a  decision,  based  on  its  own  observation.  However,  for  each 
state  of  the  environment,  only  certain  decisions  accomplish  the  desired  goal.  The 
question  "are  there  any  communications  necessary?"  may  be  then  reformulated  as  "can 
the  goal  be  accomplished,  with  certainty ,  without  any  communications?”  We  show  that 
this  problem  is,  in  general,  -a  hard  one. 

We  then  impose  some  more  structure  on  the  problem,  by  assuming  that  the  obser¬ 
vations  of  different  processors  are  related  in  a  particular  way.  The  main  issue  that 
we  address  is  "how  much  structure  is  required  so  that  the  problem  is  an  easy  one?" 
and  we  determine  the  boundary  between  easy  and  hard  problems. 

In  Section  3.3  we  study  a  particular  (more  structured)  decentralized  problem  - 
the  problem  of  decentralized  hypothesis  testing  -  on  which  there  has  been  some  interest 
recently,  and  characterize  its  difficulty. 

In  Section  3.4  we  formulate  a  few  problems  which  are  related  to  the  basic 
problem  of  Section  3.2  and  discuss  their  complexity. 

Suppose  that  it  has  been  found  that  communications  are  necessary.  The  next 
question  of  interest  is  "what  is  the  least  amount  of  communications  needed?"  This 


problem  (Section  3-5)  is  essentially  the  problem  of  designing  an  optimal  communica¬ 


tions  protocol;  it  is  again  a  hard  one  and  we  discuss  some  related  issues. 

In  Section  3.6  we  present  our  conclusions  and  discuss  the  conceptual  significance 
of  our  results.  These  conclusions  may  be  summarized  by  saying  that: 

a)  Even  the  simplest  problems  of  decentralized  decision  making  are  hard. 

b)  Allowing  some  redundancy  in  communications,  may  greatly  facilitate  the  (off-line) 
problem  of  designing  a  decentralized  system. 

c)  Practical  communications  protocols  should  not  be  expected  to  be  optimal,  as  far 
as  minimization  of  the  amount  of  communications  is  concerned. 

The  results  of  this  Chapter  appear  in  [Papadimitriou  and  Tsitsiklis,  1982; 
Tsitsiklis  and  Athans ,  1985  ]  . 


3.2  A  PROBLEM  OF  SILENT  COORDINATION 

In  this  Section  we  formulate  and  study  the  problem  whether  a  set  of  processors 
with  different  information  may  accomplish  a  given  goal  -with  certainty-  without  any 
comonuni  cations . 

Let  {1,...,M}  be  a  set  of  processors.  Each  processor,  say  processor  i,  obtains 
an  observation  y^  which  comes  from  a  finite  set  of  possible  observations.  Then, 
processor  i  makes  a  decision  u^  which  belongs  to  a  finite  set  IL  of  possible  decisions 
according  to  a  rule 

Ui  =  Yi(Yi)  (3.2.1) 

where  v.  is  some  function  from  Y.  into  U,.  The  M- tuple-  (y, ,...,y„)  is  the  total 

h  i  i  1  M 


information  available;  so  it  may  be  viewed  as  the  "state  of  the  environment." 


For  each  state  of  the  environment,  we  assume  that  only  certain  M-tuples 


(u  , ...,u  >  of  decisions  accomplish  a  given,  externally  specified,  goal.  More 
1  M 

precisely,  for  each  (y_,...,y  )e  Y,  x. ..x  Y  ,  we  are  given  a  set  S(y  , ...,v  )£ 

1  M  1  M  1  T4 

U  x. ..x  U  of  satisficing  decisions.  (So,  S  may  be  viewed  as  a  function  from 

1  M  U  x...x  U 

Y  x  Y  x. ..x  Y  into  2  j. 

12  M 

The  problem  to  be  studied,  which  we  call  the  "distributed  satisficing  problem" 
(after  the  term  introduced  by  H.  Simon  [1980])  may  be  described  formally  as  follows: 


Distributed  Satisficing  (PS) ;  Given  finite  sets  Y^,...,Y^, 
U  x. . .xU 

S:  Y.  x.  ..x  Y  2  ,  are  there  functions  Y.:  Y.  -+-U. 

1  M  ’ill 


(Y1(Y1)  '  “*'YM(YM))e  S(yi,‘*"yM)'  V(yi"‘*'yM)e 


U  ,...., U  and  a  function 
X  M 

,  i=l , 2 , . . . ,M,  such  that 

Y  x. ..xY  ?  (3.2,2) 

X  N 


Remarks : 

1.  We  are  assuming  that  the  function  S  is  "easily  computable";  for  example,  it  may 
be  given  in  the  form  of  a  table. 

2.  The  centralized  counterpart  of  DS  would  be  to  allow  the  decision  u^  of  each  agent 

depend  on  the  entire  set  (y  , ...,y  )  of  observations;  so,  Y.  ■  •  uld  be  a  function 

X  M  l 

from  Y  x...x  Y  into  U. .  (This  corresponds  to  a  situation  in  which  all  processors 
1  Mi 

share  all  information).  Clearly,  then,  there  exist  satisfactory  (satisficing)  func¬ 
tions  Y. :  Y,  x. .  .x  Y  U.  ,  if  and  only  if  S  (y, , . . .  ,yM)^,  V(y, » •  •  •.»YM)eY  x...x  Y^. 
Since  S  is  am  "easily  computable"  set  as  a  function  of  its  arguments,  we  can  see  that 
the  centralized  counterpart  of  DS  is  a  trivial  problem.  So,  any  difficulty  inherent 
in  DS  is  only  caused  by  the  fact  that  information  is  decentralized* 
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3.  A  "solution"  for  the  problem  DS  cannot  be  a  closed- form  formula  which  gives  an 
answer  0  (no)  or  1  (yes).  Rather,  it  has  to  be  an  algorithm,  a  sequence  of  ins¬ 
tructions,  which  starts  with  the  data  of  the  problem  (Y, ,...,Y  .  U, ,...,U„,S)  and 

X  NX  N 

eventually  provides  the  correct  answer.  Accordingly,  the  difficulty  of  the  problem 
DS  may  be  characterized  by  determining  the  place  held  by  DS  in  the  complexity  hierarchy 
For  definitions  related  to  computational  complexity  and  the  methods  typically  used, 
the  reader  is  referred  to  [Garey  and  Johnson,  1979;  Papadimitriou  and  Steiglitz,  1982]. 

4.  If,  for  some  i,  the  set  th  is  a  singleton,  processor  i  has  no  choice,  regarding 
its  decision  and,  consequently,  the  problem  is  equivalent  to  a  problem  in  which 
processor  i  is  absent.  Hence,  without  loss  of  generality,  we  only  need  to  study 
instances  of  DS  in  which  I U . I >  2, Vi  . 

I  !  I  — 

5.  We  believe  that  the  problem  DS  captures  the  essence  of  coordinated  .decision  making 
with  decentralized  information  and  without  communications  (silent  coordination) . 

Some  initial  results  on  DS  are  given  by  the  following: 

Theorem  3.2.1: 

a)  The  problem  DS  with  two  processors  (M=2)  and  restricted  to  instances  for  which  the 
cardinality  of  the  decision  sets  is  2  (|u^|=2,  i=l,2)  may  be  solved  in  polynomial  time. 

b)  The  problem  DS  with  two  processors  (M=2)  is  NP-complete,  even  if  we  restrict  to 
instances  for  which  j  | =2 ,  | | =3 . 

c)  The  problem  DS  with  three  (or  more)  processors  (M>3)  is  NP-complete,  even  if  we 
restrict  to  instances  for  which  |u^|=!2,  Vi. 

Proof:  We  will  only  prove  here  part  (c) ;  the  first  two  parts  are  corollaries  of 
Theorem  3.2.2  which  is  a  stronger  result. 


Our  starting  point  is  the  satisfiability  problem  for  propositional  calculus 


with  three  literals  per  clause  ( 3SAT)  which  is  a  known  NP-complete  problem  [Garey 
and  Johnson,  1979].  We  will  reduce  3SAT  to  DS  with  three  processors  (M=3)  and  binary 
decision  sets.  Given  an  instance  of  3SAT,  let  V  be  the  set  of  literals  and  C  the  set 
of  clauses.  We  construct  an  instance  of  DS  as  follows: 


Let  Y  -{lf2,...,|vl},Ui«{0fl}f  i=l,2,3.  Let  S (k,k,k)={ (0,0,0) , (1,1,1) } , 

k=l,...,|y|.  Finally,  we  interpret  each  clause  in  C  as  stating  that  a  certain  triple 

*  _ 

is  not  in  the  satisficing  set.  (For  example,  the  clause  (x^  v  x_. vx^  ,  where 
l£i<j<k<Jv| ,  translates  to  the  statement  that  the  triple  of  decisions  (0,1,0)  does 
not  belong  to  S(i,j,k). 

If  there  exists  a  satisfying  assignment  for  the  instance  of  3SAT,  let  us  define 
decision  rules  yi(j)=0,  i=l,2,3,  (respectively  y ^  ( j)=l,  i=l,2,3)  if  the  satisfying 
assignment  sets  the  j-th  literal  to  zero  (respectively  1) .  It  then  follows  that  there 
exist  satisficing  decision  rules. 

Conversely,  if  there  exist  satisficing  decision  rules  (Y1»Y2»Y3),  we  must  have 
Y1(j)  “  Y2(j)  =  Y3(j)»  vj;  we  assign  this  common  value  of  Y^(j)  to  the  j-th  literal 
in  V,  to  obtain  an  assignment  that  satisfies  the  clauses  in  C.  ■ 


Theorem  3.2.1  states  that  the  problem  DS  is,  in  general,  a  hard  combinatorial 
problem,  except  for  the  special  case  in  which  there  are  only  two  processors  and  each 
one  has  to  make  a  binary  decision.  It  should  be  noted  that  the  difficulty  is  not  caused 
by  an  attempt  to  optimize  with  respect  to  a  cost  function,  because  no  cost  function  has 
been  introduced.  In  game  theoretic  language,  we  are  faced  with  a  "game  of  kind"* 
rather  than  a  "game  of  degree 


We  will  now  consider  some  special  cases  (which  reflect  the  structure  of  typical 
practical  problems)  and  examine  their  computational  complexity,  trying  to  determine 
the  dividing  line  between  easy  and  hard  problems.  From  now  on  we  restrict  our  at¬ 
tention  to  the  case  in  which  there  are  only  two  processors.  Clearly,  if  a  problem 
with  two  processors  is  hard,  the  corresponding  problem  with  three  or  more  processors 
cannot  be  easier. 

We  have  formulated  above  the  problem  DS  so  that  all  pairs  (y^y^e  Y^xY2  are 
likely  to  occur.  So,  the  information  of  different  processors  is  completely  unrelated; 
their  coupling  is  caused  only  by  the  structure  of  the  satisficing  sets  ®(y.  ,y2)*  In 
most  practical  situations,  however,  information  is  not  completely  unstructured:  when 
processor  1  observes  y^,  it  is  often  able  to  make  certain  inferences  about  the  value  of 
the  observation  y2  of  the  other  processor  and  exclude  certain  values.  We  now  formalize 
these  ideas: 

Definition:  An  Information  Structure  I  is  a  subset  of  Y^xY^  We  say  t^iat  311  informa¬ 
tion  structure  I  has  degree  (D  ,D  ) ,  (D  ,D  are  positive  integers)  if: 

X  m  X  4 

(i)  For  each  y  eY  there  exist  at  most  D  distinct  elements  of  Y  such  that 

XX  X  “ 

(y1»y2)ei. 

(ii)  For  each  y2eY2  t^ere  exist  at  most  D2  distinct  elements  of  Y^  such  that 

(y;L,y2)ei. 

(iii)  are  the  smallest  integers  satisfying  (i) ,  (ii) . 

We  now  interpret  this  definition:  The  information  structure  I  is  the  set  of  pairs 
(Yl,y2)  of  observations  that  may  occur  together.  If  I  has  degree  (D^,D2)  processor  1 


may  use  its  own  observation  to  decide  which  elements  of  Y  may  have  been  observed  by 


processor  2.  In  particular,  it  may  exclude  all  elements  except  for  (at 


most)  of  them.  The  situation  faced  by  processor  2  is  symmetrical. 

If  D^=l  and  processor  1  observes  y^,  there  is  only  one  possible  value  for  y^. 

So,  processor  1  knows  the  observation  of  processor  2.  (The  converse  is  true  when 
D^=l) .  He  could  call  this  a  nested  information  structure  because  the  information  of  one 
processor  contains  the  information  of  the  other. 

When  D  =D2=1,  each  processor  knows  the  observation  of  the  other;  so,  their 
information  is  essentially  shared. 

Since  pairs  (y  ,y  )  not  in  I  cannot  occur,  there  is  no  meaning  in  requiring  the 

X  X 

processors  to  make  compatible  decisions  if  (y^,y2)  were  to  be  observed.  This  leads  to 
the  following  version  of  the  problem  DS; 

V°2 

DSI:  Given  finite  sets  Y  , Y  ,U  ,  U  ,  I  C  Y.xY..  and  a  function  S:I-»-2  ,  are  there 

-  12  12  12 

functions  Y  .  :  Y .  •*  U .  ,  i=l ,  2 ,  such  that 

ii  i 

(Y^Y^  »Y2(y2))€S(y;L,y2)  ,  V(y1»y2)6  I?  (3.2.3) 

Note  that  any  instance  of  DSI  is  equivalent  to  an  instance  of  DS  in  which 

S(y  ,y  )=0  XU  ,  V(y  ,y  )fZl.  That  is,  no  compatibility  restrictions  are  placed  on  the 

X  a  X  x  X  x 

decisions  of  the  two  processors,  for  those  (y  ,y  )  that  cannot  occur. 

X  X 

He  now  proceed  to  the  main  result  of  this  Section: 

Theorem  3.2.2: 

a)  The  problem  DSI  restricted  to  instances  satisfying  any  of  the  following: 

(i)  One  or  more  of  |u^|,  lu2l'  Di'D2  equa*  to 


(ii)  |u.  l-luJ-2 


(ill)  D1=D2=2' 


(iv)  D1  =  | U2  |  =2 ,  (or  D2=|u;l|=2), 
may  be  solved  in  polynomial  time. 

b)  The  problem  DSI  is  NP-complete  even  if  we  restrict  to  instances  for  which 


|UX|=  D1=3,  |U2j=  D2=2  . 


Proof :  In  order  to  study  the  complexity  of  the  distributed  satisficing  problem,  we 

shall  first  point  out  the  close  connection  between  this  problem  and  a  family  of  res¬ 
tricted  versions  of  the  satisfiability  problem  for  propositional  calculus,  which  we 
call  RSAT.  A  formula  of  RSAT  has  a  set  of  variables  F=F  UF,UF->'  where 

X  M  J 

Fx  =  {y±j:  j=l,2,3>, 

F_  =  {,z .  *  i-1 1  •  •  •  g m}  f  F  —  {x,  s  i=l  /  •  •  •  #n}  / 

»  x  j  1 

and  &,m,n,  are  non-negative  integers. 

The  set  C  of  clauses  of  an  instance  of  RSAT  cure  the  following: 

(i)  One  clause  for  each  i  between  1  and  i,  stating  that  exactly  one  of  the 
variables  yii'yi2,yi3  true' 

(ii)  An  arbitrary  number  of  clauses  of  the  form  (y. .vx  )  or  (y..v  x  ),  and 

H  J-J  9 

(iii)  An  arbitrary  numbers  of  clauses  of  the  form  (z^vxj,  (  z^vx^),(z^v  xj, 
(z.vx.). 


For  any  fixed  i,  we  will  say  that  the  variables  y^1»yi2'yi3  be*on9  to  the  same 
group.  Each  variable  in  F2  or  F^  will  be  thought  as  forming  a  separate  group.  A 


V  vv  / , 


clause  of  type  (il)  or  (iii)  is  said  to  connect  two  c_ro 


e  literals  of  that 


clause  belong  to  these  groups.  Finally,  we  will  say  i  instance  of  RSAT  has 

degree  (D^,D2)  if  each  group  of  variables  in  or  -  connected  to  at  most 
other  groups  and  each  group  of  variables  in  F^  is  connected  to  at  most  other 
groups . 

Lemma  3.2.1: 

a)  RSAT  restricted  to  instances  of  degree  (D^,D2)  iS  e<?uivaient  to  DSI  restricted 
to  instances  of  degree  (D  ,D  ) ,  with  |u^j=3,  |u  |=2. 

b)  RSAT  restricted  to  instances  for  which  F^=*$  is  equivalent  to  DSI  restricted  to 
instances  for  which  | | =  | U2 | =2 . 

Proof:  Let  Y^={l, . . . ,l+m},  Y2={l, — ,n}.  Think  of  y_  as  stating  that,  if  processor 
1  observes  i  (with  l£i<£)  then  it  decides  j;  similarly,  if  processor  1  observes  £+i 
(with  l£i<m)  it  decides  1  (respectively  2)  if  z^l  (respectively  0;.  Finally,  if 
processor  2  observes  i  (l<i<n) ,  it  decides  1  (respectively  2)  if  x^=l  (respectively  0) . 

We  may  then  interpret  the  clauses  of  RSAT  of  type  (ii)  and  (iii)  as  stating  that 
certain  pairs  of  decisions  are  not  in  the  satisficing  set.  (For  example ,  the  statement 
(3,2)0  S(i,j),  for  some  i,j,  l£i<2-,  l<j<n,  is  equivalent  to  the  clause  (  y.  vx,).) 

Note  that  processor  1  may  choose  between  three  decisions  and  processor  2  between 
two  decisions,  so  that  | | =3 ,  |u2|=2.  (In  fact,  if  processor  1  observes  i,  with 
&+l<i<H+m,  it  may  only  decide  1  or  2.  But  this  is  the  same  as  if  it  could  also 
decide  3  but  the  satisficing  sets  are  such  that 

(3,k) £S(i,  j)  ,  Vke{l,2},  VieU+1, . . . , Jl+m} ,  V je{l, . . .  ,n} . ) 
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In  the  case  where  F^=<J),  processor  1  has  only  two  choices  for  each  observation, 
and  this  corresponds  to  DSI  with  |u  |=|u  |=2. 

X  x 

Clearly,  each  "group"  of  variables  corresponds  to  a  possible  observation.  If 
two  groups  are  not  connected  by  a  clause,  then  no  constraint  is  placed  on  the  compat¬ 
ibility  of  decisions  for  the  corresponding  pair  of  observations,  and  conversely. 

But  placing  no  compatibility  constraint  is  equivalent  to  assuming  that  this  pair  of 
observations  may  never  occur  together.  This  shows  that  the  degree  of  an  instance  of 
RSAT  is  the  same  as  the  degree  of  the  information  structure  I  for  the  corresponding 
instance  of  DSI. 

Clearly,  the  above  construction  may  proceed  both  ways  which  shows  the  equivalence 
of  the  two  problems.  This  concludes  the  proof  of  Lemma  3.2.1.  ■ 

a)  (i)  If  U^=l  or  U^-l,  the  Pr°blem  is  trivial.  If  D2=l,  a  satisficing  decision 
rule  exists  if  and  only  if 

(^1  7r1(s(y1,y2))^,  vyieY1  , 

{y2: (y1»y2)ei} 

where  (u^»u2)=h^.  T*le  ®k°ve  condition  can  be  clearly  tested  in  polynomial  time. 

a)  (Li)  By  Lemma  3.2.1(b),  DSI  with  | | = | | is  equivalent  to  RSAT  restricted  to 
instances  for  which  F^=4>;  the  latter  is  a  special  case  of  2SAT  (the  satisfiability 
problem  of  propositional  calculus  with  two  literals  per  clause)  which  may  be  solved 
in  time  which  is  a  linear  function  of  the  number  of  variables  and  clauses.  So,  a 
0 ( | Y  | * | y2 I )  solution  is  possible. 
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a)  (iii)  Let  0^=02*  Possibly  by  renaming,  assume  that  Y  ,  are  disjoint  sets. 

Consider  the  graph  G=(Y^UY2,I).  (Here  ^^^2  t*ie  set  nodes'  1  the  set  of 

undirected  edges).  Each  node  of  this  graph  has  degree  at  most  2.  Therefore,  the 

* 

connected  components  of  G  are  either  isolated  nodes,  chains  or  cycles.*  Each  con¬ 
nected  component  of  G  defines  a  subproblem  and  these  subproblems  are  decoupled.  So, 
without  loss  of  generality,  we  may  assume  that  G  consists  of  a  single  connected 
component.  We  complete  the  proof  under  the  assumption  that  G  is  cycle.  (If  G  is  a 
chain,  the  proof  is  similar  but  simpler;  in  fact  we  may  introduce  an  additional  edge 
and  make  G  is  a  cycle;  this  will  not  make  the  problem  any  easier,  because  a  new  edge 
simply  allows  the  presence  of  more  constraints.  If  G  is  an  isolated  node  the  problem 
is  trivial) . 

Let  us  assume  that  G  is  a  cycle.  Then  Y^  and  Y2  must  have  the  same  number  n 
of  elements.  In  particular ,  the  elements  of  Y1»Y2  may  be  numbered  so  that  (see  Fig. 
3.2.1).  I={ (i , i) :  i=l, . . . ,n}U{  (i,i-l) :  i=2, . . . ,n}U{ (l,n) } . 


Let  us  define 


(u,,u 

i'  ,)€ 

!  U  xU_ : 

3(u  ,u')i 

1 

n-1 

1  2 

n  n 

(u  , 

u')e 

S (n,n) , 

(u  ,u' 

n 

n 

\ 

n  n-1 

<V 

u')e 

n 

S(l,n)J 

and  note  that  S'(l»n-1)  may  be  evaluated  in  0 


[1“/!"/] 


time.  We  now  have: 


*  A  graph  with  n  nodes  is  a  chain,  if  its  nodes  may  be  numbered  so  that 
E-{(i  ,i+l) :  i=l, . . . ,n-l};  it  is  a  cycle  if  its  nodes  may  be  numbered  so  that 
E={(i,i+1):  i=l, . . . ,n-l} U { (n ,1) } .  (Here  E  stands  for  the  set  of  edges.) 


An  instance  of  DSI  is  a  "YES"  instance 


3(u,  ,u' , . . .  ,u  ,u' )  (u. ,u! )es (i,i) ,  i=l, . . .  ,n  and 
11  n  n  1  i 


(u^,u^  ^jesti/i-l),  i=2,...,n  and 


(u  ,u'  )6S  (l,i 
1  n 


3(u. ,u' , . . . ,u  _,u'  )  (u. ,u! )€S (i,i) ,  i=l,...,n-l  and 

1  1  n-1  n-1  L  i  i 


(ir/U^  1)es(i,i-l),  i=2,...,n-l  and 

(u  »u'  )6S'  (l,n-l)l  . 

1  n-1  J 

This  last  expression  corresponds  to  a  new  instance  of  DSI  in  which  n  has  been  replaced 

by  n-1.  Proceeding  in  the  same  way,  the  problem  will  be  solved  after  at  most  n  similar 

stages.  This  is  essentially  an  algorithm  of  the  dynamic  programming  type  which  solves 

2  2 

the  problem  in  time  0 ( | | | |  |  |  ) . 


Remark-.  In  fact  an  0  ( |  Y^^  |  |  uj  1 |  ( |  U.J  +  |  U2  | ) )  solution  may  be  obtained,  if  at  each 
stage  of  the  dynamic  programming  algorithm  we  only  eliminate  one,  rather  than  two, 
variables.  If  G  is  a  chain,  an  0  C | Y  | | U  | | U  | )  algorithm  is  obtained  along  the  same 

JL  -L  ^ 

lines. 

a)  (iv)  We  now  suppose  that  D^  =  | | =2 .  Let  Y^  =  {l,...,m},  =  {l,...,n}  and 

assume,  without  loss  of  generality,  that  for  each  keY^  there  exist  exactly  two  distinct 
elements  i,  j  of  Y^  such  that  (k,i)ei,  (k,j)ei. 


Note  that  we  have  a  "YES"  instance  of  DSI  if  and  only  if 


Consider  also  the  statement 


3u1,....,uneu2  V(i/ j)6Y2xY2  Vk6Y1  |j(k,i)ei  A(k, j)ei  Ai^j] 
3 vk6Ul  [(vk'Ui)eS  (k'i)A  (vk / u  J  es  (k ,  j )  ]J 


(3.2.5) 


which  is  equivalent  to 


3u  , — , u  eu  V(i, j)e  Y  xY  [(u. ,u.)es' (i,j) ],  (3.2.6) 

x  n  z  ^  z  x  j 

where 

s'  (i,  j)  =  |(u,u')eu2xu2:  vk6Y1  £[(k,i)eiA  (k,j)eiA  i*j]=> 

3vkeUl[(vk'U)eS(k,i)  A  (Vk,u’)eS(k,j)  ]]}.  (3.2.7) 

If  (3.2.4)  holds,  then  it  is  easy  to  see  that  (3.2.5)  holds,  as  well,  with 

the  same  assignment  of  values  to  u.  ,v,  .  Conversely,  assume  that  (3.2.5)  holds. 

1  K 

For  each  k,  there  exists  only  one  pair  (i,j)  such  that  the  condition 

[  (k,i)ei]A[(k,  j)ei]A[  (i  j*j)]  holds.  Accordingly,  for  each  k,  the  clause 

[  (v,  »u.  )6S  (k,i)  A  (v.fU.)es(k,  j)  ]  needs  to  be  checked  only  for  one  pair  (i,j). 

K  1  K  ] 

Therefore,  for  each  k,  a  value  of  vk  which  makes  (3.2.5)  true  cam  be  chosen  regardless 

of  (i,j)  and  this  value  makes  (3.2.4)  true  as  well. 

Therefore,  we  only  need  to  show  that  the  truth  of  (3.2.5)  can  be  decided  in 

polynomial  time.  Note  that,  for  each  (i,j),  the  set  S' (i, j)  defined  by  (3.2.7)  may 

2 

be  constructed  in  time  0 ( | Y^ | | | ) .  Moreover  there  are  at  most  min{ | Y2 |  , | Y^ | }  pairs 

to  be  considered;  so  the  sets  S'(i,j)  may  be  constructed  in  time 
2 

0(|y  I|u  |min{|Y  I  ,|y  |}).  Once  S' (i,j)  is  constructed,  the  statement  (u. ,u.)es' (i,j) 
may  be  expressed  as  a  set  of  clauses  with  two  literals  per  clause  (the  literals  axe 
the  boolean  variables  u^,u_.;  this  is  similar  to  the  proof  of  part  (a)  (ii) ) .  Therefore, 
deciding  the  truth  of  (3.2.5)  is  a  special  case  of  the  satisfiability  problem  of  pro¬ 
positional  calculus  with  two  literals  per  clause  (2SAT) ,  which  can  be  solved  in  linear 
time.  This  concludes  the  proof  of  part  (a) . 


Lemma  3.2.2: 


RSAT  is  NP-complete,  even  if  F2=<fu 

Proof:  The  proof  consists  of  reducing  to  RSAT  (with  F2=<f>)  t*ie  Problem  of  satisfia¬ 
bility  of  propositional  formulae  with  three  literals  per  clause  (3SAT) .  Given  such  a 
propositional  formula,  we  shall  construct  an  equivalent  RSAT  formula  F,  as  follows: 

For  each  variable  of  the  original  formula,  we  have  in  F  an  x-variable.  For  each 
pair  of  variables  a  and  b  of  the  original  formula,  we  add  to  F  two  triples  of 

y-variables  y  ,  .  and  y'  ,  j=l,2,3,  and  the  corresponding  "exactly  one  is  true" 

abj  ab j 

clause,  for  each  triple.  (For  example,  the  clause  "exactly  one  of  y  ,  ,,  y  y  .  _ 

abl  ab2  ab3 

is  true,"  may  be  written  as 

‘WWW  Al7abl'^ab2)A(^2Vf»b3)A(?ab3V7abl))  ‘ 


Also,  we  add  to  F  the  x-variable  and  the  following  ten  clauses: 


1  yablVal  ' 1  yablVb)  '  <  yab2Va) '  1  yab2V  b) ' 


<  yIblV  a) ' 1  yiblVb> ■ 1  y(b2V  a) ' 1  yi)2V  b) ' 


<  yab3VZab>  ' (  yi>3V  V  ' 


It  is  easy  to  verify  that  the  above  ten  clauses  force  the  variables  y  ,  . ,y  .  _,y'  .,y'  _ 

abl  ab2  abl  ab2 

to  always  take  the  same  values  as  the  expression  (aAb),(a  A  b) ,  (  a  A  b)  ,  (  a  A  b) , 
respectively.  Using  this  observation,  we  can  rewrite  any  three  literal  clause  of  our 
original  formula  as  a  two-literal  clause  of  F.  (For  example,  the  clause  (aVbVc)  is 
equivalent  to  (aVbVc)  =  (aAb)vc  =  (y^^Vc)  which  is  of  the  type  of  clause  allowed 


in  RSAT).  This  completes  the  proof  of  Lemma  3.2.2.  □ 


[•] 


The  proof  of  part  (b)  of  Theorem  is  completed  by  combining  part  (a)  of 
Lemma  3.2.1  with  the  following: 

tpimh  3.2.3:  RSAT  is  NP-complete,  even  if  it  is  restricted  to  instances  for  which 


Proof :  The  idea  of  the  proof  is  to  create  multiple  copies  of  each  variable  so  that, 

instead  of  connecting  a  "group11  of  variables  to  many  other  groups,  we  connect  each 

time  a  different  copy  of  the  same  group. 

Suppose  that  we  are  given  an  instance  (F  ,F  ,F  ,C)  of  RSAT  with  F  =<j>.  We  will 

X  Z  3  +* 

construct  an  equivalent  instance  (F' ,F* ,F' ,C' )  of  RSAT,  for  which  D  =3,  D  =2.  We  let 

12^  l  ^ 

Fi  =  {yij:  3-1*2, 3;  p=l,...,|c|}UF1, 

F2  *  ^*i:  *— *■— 1 tl;  1S)llcl^' 

A 

F3  =  {yij:  1-i-4;  3=1*2, 3;  l<p<|c|}U{x*,x*:  l<i<n;  15J?<|c|}UF3  • 


A 

(Here  y?.,y*\  (respectively  x?,x?,x^)  are  meant  to  be  copies  of  y^_.  (respectively 


il  13 

The  clauses  in  C'  are  the  following: 

(i)  For  each  i,  a  clause  stating  that  exactly  one  of  Yil*y^2,yi3 

(ii)  For  each  i, j clauses  stating  that: 


is  true 


13 


13  13 


13 


13 


13 


£l  =  x  3?  =  X?-1  =  £P 

X.  X.  t  X.  X.  t  X.  X, 


(Note  that  an  equality  a=b  may  be  written  as  (avb)A(avb)  •) 

(iii)  Finally,  for  each  clause  of  C  (say  the  k-th  clause) introduce  an  equivalent 

clause  in  C'  connecting  the  k-th  copies  of  the  variables  involved.  For  example,  if 

___  _ 

the  k-th  clause  is  (  y. .V  x  1  it  would  become  (  y. .V  x  ).  It  is  easy  to  check 

il  q  il  q 

that  this  new  instance  of  RSAT  has  degree  (3,2)  and  is  equivalent  to  the  original  ins¬ 
tance.  This  concludes  the  proof  of  the  Theorem.® 

The  result  concerning  the  case  0^=1  or  D2=l  not  surprising-  it  is  well-known 
that  nested  information  structures  may  be  exploited  to  solve  otherwise  difficult 
decentralized  problems.  But  except  for  the  case  D^=D2=2  (which  is  sort  of  a  boundary) 
the  absence  of  nestedness  makes  decentralized  problems  computationally  hard.  Our 
result  gives  a  precise  meaning  to  the  statement  that  non-nested  information  structures 
are  much  more  difficult  to  handle  than  nested  ones. 

Theorem  3.2.2  shows  that  even  if  D^,D^  axe  held  constant,  the  problem  DSI  is,  in 
general,  NP-complete.  There  is,  however,  a  special  case  of  DSI,  with  ,  D 2  constant, 
for  which  an  efficient  algorithm  of  the  dynamic  programming  type  is  possible. 

Theorem  3.2.3:  Let  Y^=Y2={l,2, . . . ,n}.  Let  D  be  a  positive  integer  constant.  Consider 
those  instances  of  DSI  for  which  (i,j)ei  implies  either  |i-j|£D  or  |i-j|>n-D.  Then 
DSI  may  be  solved  in  time  which  is  polynomial  in  n. 

A  condition  of  the  type  | i— j  |_<  D,  V(i,j)ei  is  fairly  natural  in  certain  ap¬ 
plications.  For  example,  suppose  that  the  observations  y^  and  y2  are  noisy  measurements 
of  an  unknown  variable  x  (y^=x+w^)  where  the  noises  w^  sure  bounded:  |w^|<D/2.  Similarly 

the  condition  |i-j|£D  or  |i-j|>n-D,  V(i,j)ei,  arises  if  the  observations  y^,  are 

noisy  measurements  of  an  unknown  quantized  angle:  y^  =  0+w^  (mod  2ir) ,  where  the  noises 
w.  are  again  bounded  by  D/2. 


Proof:  This  proof  is  effectively  a  generalization  of  the  dynamic  programming  argument 


in  the  proof  of  part  (a)  (iii)  of  Theorem  3.2.2.  Let  us  assume  that  n>2D.  For  any  k 
such  that  2D<k<n,  let 


r(jo  =  {<vVVv2 . VVVd+i'Vch-1 . , 


2D 


,k-2D 


vV^W 

a(Vi'Vi . Vo'W^W"  such  that 

(U^vjesu,  j)  ,  V(i,j)€{i,...,k}2n  ij 

Note  that  T(k)  is  of  size  at  most  | | | | •  Now  assume  that  2D<k<n-l  and  let 


^  (  2D+2 

T(k+1)  =  |(u1,V1,u2.V2 . uD'VD'VD+l'Vk-D+l'***'Uk+l'Vk+l)e(UlXU2)  : 


,  k-2D 


3(UD+1'VD+1 . Vd'Vd^W  SUCh  that 

(u±,v  jes(i, j) ,  v(i,j)e{i,...,k+i}2ni| 

Using  the  assumption  | i— j | D,  or  |i-j|>n-D,  V(i,j)ei,  we  can  see  that 


2  2 

{l,...,k+l>  niCl{lr...,k}  niJUAk+i 


where 


=  {1, . . . ,D,k-D+l, . . . ,k+l}  Hi. 

A 

With  this  observation,  F (k+1)  may  be  rewritten  as 


T  (k+1) 


{( U1 ' V1 ' U2 '  V2 '  *  *  * ' UD ' VD '  Vd+1  ' Vd+1 "  *  *  ’  \+l f' Vk+1)  6 (U’  ) 


2D+2 


1  2' 


( V  W  V  *  *  * '  V  VVd+I'Vd+I' . \,VK)€r  00 

(u.  ,v  )es(i,  j) ,  V(i,j)e  A.  ..1  . 


Assuming  that  the  set  T(k)  has  been  computed,  we  may  use  it  to  evaluate  r(k+l) 

i  i 2D i  I  2D 

as  follows:  for  each  element  of  T (k)  (at  most  |U^|  |  LI^ |  elements)  try  each  pair 

( |  U1 1  |  U2 1  pairs)  and  check  for  each  (i,j)6Ak+1  (\+1  has  at 

2 

most  4D  elements)  whether  (u^,\0€S  (i,  j)  holds.  Therefore,  given  T  (k)  ,  we  may  obtain 
?(k+l)  in  time  0 (D2| U^J 2D+^j 2D+1) .  Finally,  from  T(k+1)  we  may  easily  obtain 


T (k+1) ,  by  taking  a  projection  so  as  to  eliminate  u^  D+^*  This  process  may  be 
repeated  (for  no  more  than  n  stages)  to  compute  T (n) ,  in  time  0 (nD2 | | 20+1 | U2 | 2°+1) • 
Then  note  that  we  have  a  YES  instance  of  DSI  if  and  only  if  Fln)^.* 


Remark:  The  algorithm  in  the  proof  of  Theorem  3.2.3  does  not  find  a  satisficing 
decision  rule;  it  only  determines  whether  one  exists.  However,  satisficing  decision 
rules  may  be  computed  by  keeping  in  the  memory  some  of  the  intermediate  results  produced 
by  the  algorithm. 


3.3  DECENTRALIZED  DETECTION 

A  basic  problem  in  decentralized  signal  processing,  which  has  attracted  a  fair 
amount  of  attention  recently,  is  the  problem  of  decentralized  detection  (hypothesis 
testing)  [Tenney  and  Sandell,  1981;  Ekchian,  1982;  Ekchian  and  Tenney,  1982;  Kushner 
and  Pacut,  1982;  Lauer  and  Sandell,  1983].  In  this  section  we  consider  a  simple 
(discrete)  version  of  this  problem  involving  only  two  processors  and  two  hypotheses. 

Two  processors  S  and  S  receive  observations  y  6Y  ,  y  €Y  ,  respectively,  where 
Yi  is  the  set  of  all  possible  observations  of  processor  i.  (Figure  3.3.1).  There 


are  two  hypotheses  H  and  H  on  the  state  of  the  environment,  with  prior  probabilities 


Pq  and  ,  respectively.  For  each  hypothesis  EL ,  we  are  also  given  the  joint  prob¬ 
ability  distribution  P(y^»y2|Hj  of  the  observations,  conditioned  on  the  event  that 
EL  is  true.  Upon  receipt  of  y^,  processor  evaluates  a  binary  message  u^e{0,l} 
according  to  a  rule  u^  =  where  y^:  Yr*-{0,l}.  Then,  u^  and  u2  are  transmitted 

to  a  central  processor  (fusion  center)  which  evaluates  u  =u,u  and  declares  hypo- 

o  1  2 

thesis  H  to  be  true  if  u  =0,H,  if  u  =1.  (So,  we  essentially  have  a  voting  scheme), 
o  o  1  o 

The  problem  is  to  select  the  functions  y^,  y2  so  as  to  minimize  the  probability  of 
accepting  the  wrong  hypothesis.  (More  general  performance  criteria  may  be  also 
considered) . 

Most  available  results  assume  that 


p(yi»y2lHi)  =  P(y1|Hi)P(y2|H.),  i=l,2 


(3.3.1) 


which  states  that  the  observations  of  the  two  processors  are  independent ,  when  con¬ 
ditioned  on  either  hypothesis.*  In  particular,  it  has  been  shown  [Tenney  and  Sandell, 
1981]  that  if  (3.3.1)  holds  then  the  optimal  decision  rules  y^  are  given  in  terms  of 


thresholds  for  the  likelihood  ratio 


The  optimal  thresholds  for  the  two 


sensors  are  coupled  through  a  system  of  equations  which  gives  necessary  conditions  of 
optimality.  (These  equations  are  just  the  person-by- person  optimality  conditions) . 

Few  analytical  results  are  available  when  the  conditional  independence  assumption  is 
removed  [Lauer  and  Sandell,  1983].  The  purpose  of  this  section  is  precisely  to  explain 
this  status  of  affairs. 

Suppose  that  (3.3.1)  holds  and  let  EL  denote  the  cardinality  of  Y^.  Given  the 
results  of  [Tenney  and  Sandell  [1981]]  there  are  only  IL  +  1  decision  rules  which 
cure  candidates  for  being  optimal.  We  may  evaluate  the  cost  associated  to  each  pair  of 
candidate  decision  rules  and  select  a  pair  with  least  cost.  This  corresponds  to  a 


*Such  an  assumption  is  reasonable  in  problems  of  detection  of  a  known  signal  in  in¬ 
dependent  noise,  but  is  typically  violated  in  problems  of  detection  of  an  unknown  signal 


! 
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polynomial  algorithm  and  shows  that  under  condition  (3.3.1)  decentralized  detection 
is  an  easy  problem.  Without  the  conditional  independence  assumption  (3.3.1),  however, 
there  is  no  guarantee  that  optimal  decision  rules  can  be  defined  in  terms  of  thresholds 
for  the  likelihood  ratio.  Accordingly,  a  solution  by  exhaustive  enumeration  could 

VN2 

require  the  examination  of  as  many  as  2  pairs  of  decision  rules.  One  might  ex¬ 

pect  that  a  substantially  faster  (i.e.  polynomial)  algorithm  is  possible.  However, 
the  main  result  of  this  section  (Theorem  3.3.1  below)  states  that  decentralized  detec¬ 
tion  is  NP-complete  even  if  we  restrict  to  instances  for  which  perfect  detection  (zero 
probability  of  error)  is  possible  for  the  corresponding  centralized  detection  problem. 

We  now  present  formally  a  suitable  version  of  the  problem: 

Decentralized  Detection  (DP)  ;  We  are  given  finite  sets  Y^,Y2;  a  rati°nal  number  K;  a 
rational  probability  mass  function  p:  Y^xY^  Qt  a  partition  {Aq,A^}  of  Y^xY2 .  Do 

there  exist  y.  :  Y.  -*■{0,1},  i=l,2,  such  that  J(Y,  fY0)5.  K?  Here 
XX  x  z 


J(VV 


£  p(y1rY2>Y1(y1)Y2(y2)  + 

(y. * y0 ) 6a 

12  O 


+ 


(y1/y2)€A1 


p(ylfy2) li-Y1<y1)Y2(y2) J • 


(3.3.2) 


Remarks : 

1.  In  the  above  definition  of  DD,  think  of  as  being  the  hypothesis  that  (y^,y2)SA^. 

Then  it  is  easy  to  see  that  J(Y^»Y2)  corresponds  to  the  probability  of  error 

associated  to  the  decision  rules  Y^*Y2«  Note  that  if  a  single  processor  knew  both 

y  and  y  (centralized  information)  it  could  make  the  correct  decision  with 
x  z 

certainty.  Consequently,  the  above  defined  problem  corresponds  to  the  special  case 


of  decentralized  detection  problems  for  which  perfect  centralized  detection  is 
possible . 
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2.  If  we  let  K=0,  then  DD  is  a  special  case  of  problem  DS  with  | | = | | =2 
and  is  therefore  polynomially  solvable. 


Theorem  3.3.1:  DD  is  NP- complete 


Proof:  Consider  the  following  problem  of  propositional  calculus,  which  we  call  P: 


Problem  P:  We  are 


given  two  sets  X  =  {x, ,...,x  },  Z  =  {z  , ...,z  }  of  boolean 

1  m  In 


var¬ 


iables;  a  set  D  of  (distinct)  clauses  of  the  form  x.  Az.  or  (x.  Az.)  (we  assume 

i  3  .  1  3 

that  for  any  pair  (i,j)  at  most  one  of  the  above  clauses  is  in  D);  a  collection 


y-J 


{q_:  ie{l,...,m},  je{l,...,n}}  of  non-negative  integers  and  an  integer  K.  Is  there 
a  truth  assignment  for  X  and  Z  such  that  J<K,  where 


J  -  E  qij +  5  ' 

x .  Az .  =0  J 
i  3  i  3 


(3.3.3) 


(i, j)eA1 


(i,j)eA 


:-1 


A  =  {  (i r  j ) :  the  clause  (x.Az.)is  in  d}, 
O  1  j 

A,  =  {(i,j):  the  clause  (x.Az.)is  in  d}* 
1  13 


Lemma  3.3.1:  Problem  P  is  equivalent  to  DD. 

Proof:  Think  of  X,  Z  as  being  the  sets  of  observations  of  processors  S^,S2, 

respectively.  A  truth  assignment  to  X,Z  corresponds  to  a  choice  as  to  what  binary 

message  to  transmit  to  the  fusion  center,  given  each  processor's  observation.  Let  Hq 

(respectively  H^)  be  the  hypothesis  that  (i,j)eAQ  (respectively  A^) .  Finally,  view 

q„,  as  the  (unnormalized)  probability  that  the  pair  (i,j)  of  observations  is  obtained 

by  the  two  processors.  Pairs  (i,j)  that  belong  to  neither  Aq  nor  A^  may  be  viewed  as 

having  zero  probability  and  are,  therefore,  of  no  concern.  Then,  it  is  easy  to  verify 
that  J  as  defined  by  (3.3.3)  is  precisely  the  (unnormalized)  probability  of  error. a 

*So,  J  is  the  sum  of  the  weights  q_  of  the  clauses  that  are  not  satisfied. 


ML 


In  order  to  complete  the  proof  of  the  Theorem,  we  need  to  show  that  P  is 
NP-complete.  This  will  be  accomplished  by  reducing  to  P  the  following  (Maximum 
2-Satisfiability)  problem  of  propositional  calculus  which  is  known  to  be  NP-complete 
[Garey  and  Johnson,  1979]  . 


MAX- 2- SAT:  Given  a  set  U  of  boolean  variables,  a  collection  C  of  (distinct)  clauses 
over  U,  such  that  each  clause  c  e C has  exactly  two  variables  and  an  integer  K<Jc|, 
is  there  a  truth  assignment  for  U  which  simultaneously  satisfies  at  least  K  of  the 
clauses  in  C?  (Without  loss  of  generality,  we  assume  that  if  a  clause  is  in  C,  then 
its  negation  is  not  in  C) . 


Suppose  that  we  are  given  an  instance  (U,C,K)  of  MAX- 2- SAT.  We  construct  an 
instance  of  P  as  follows:  Suppose  that  U.=  {u^,...,^}.  Then,  let 


X  -  Un.*i2.xu.  i-l.. 

. . ,n}  and  Z 

*Zil'Zi2 

,2^^:  1, . . . ,n} .  For  each  ie{l,...,n} 

introduce  the  set  D.  of 

clauses: 

UilA2i2) ' 

Ui2A2i2)  ' 

<Xi2Azi  1» ' 

lXi3A2i2) ’ 

Ui2AZi3! ' 

lXi3Azil> ' 

(xuAz.  3) , 

(Xi3A2i3)  ’ 

To  these  clauses  we  assign  the  weights  (L  is  a  large  integer  to  be  determined  later) : 


%l.i2-30L' 

qi2,i2'15L' 

qi2,U-“L' 

q ,  ,  =  2  011 , 

qi2,i3=St" 

qi3,il'2L' 

qil,i3-25L' 

q.,  .  =100L 
^i3,i3 

Next,  for  each  clause  (u^Au_J,  (u^Au^),  (uJV  u^),(u^vuj,  (u^vuj,  (u^vu.)  in 


C(withicj),  introduce  clauses  ,  (xi2^Zjl* '  *Xi2^Z j2*  '  ^Xi2^Zj2^  '  ^Xil^Zj2*' 


(x  Az  ) ,  respectively.  Denote  this  last  set  of  clauses  by  D  ,  and  assign  to  each 


one  unit  weight.  We  now  let  D  = 


u 


i  ,  D  and  observe  that  X,Z,D,{q.  K 
J&O  13 

define  an  instance  of  P. 

Note  that  for  any  assignment  for  X,Z,  the  corresponding  cost  (equation  (3.3.3)) 
may  be  decomposed  as 


J  =  J  +  £  J  #  • 


(3.3.4) 


where  j^,  &e{o,l, . . . ,n}  is  the  sum  of  the  weights  q^_.  of  the  clauses  in  D  ^  which  are 
not  satisfied  . 


Lemma  3.3.2: 

For  any  ie{l,...,n}, 

we  have 

J.= 

i 

following  is 

true: 

(i) 

Xu’Zll*Xi3‘‘i3-1' 

X.  o=z-  V 
i2  i2 

=0. 

(ii) 

*U=2i2*Xi3*2i3'1' 

X.  =Z.  : 
rl  il 

=0  . 

For  any  assignment  to  {x^,z_:  i=l*2,3}  other  than  the  two  assignments  above 


J.  >  37L. 

i  — 


Proof:  By  direct  evaluation  of  the  costs  of  each  possible  assignment.  (See 

Table  3.3.1).  Note  that  in  the  calculations  in  Table  3.3.1,  we  only  consider  assignment 

such  that  x.  =z.  =1.  This  is  because  the  clause  (x.,Az._)  carries  large  enough  weight 
i3  i3  1.3  i3 

(q^3  ^3=100L) ,  so  that  the  possibility  of  letting  (x^Az^3)=0  “ay  inroadiately 
discarded. g 


TABLE  3.3.1 


cost  of  each  assignment  to  'tx^i’x^2’xi3’Zil,zi2,Zi3^'  CThe  ^actor  L  as  omitted.) 


Clause 


x.,Az.  x.-Az.  x.-Az  x._Az._  x..Az._  x._Az  x..Az 
ll  i2  i2  x2  i2  ll  i3  i2  i2  i3  i3  il  ll  i3 


25  TOTAL 


In  view  of  Lemma  3.3.2,  the  clauses  in  and  their  associated  weights 

have  the  following  interpretation:  the  variable  x^  may  be  freely  assigned,  but 

the  remaining  variables  must  be  assigned  so  that  x±2=z±2=  Zil=  Xil  For 

reason  the  clauses  in  D  are  effectively  the  same  as  the  original  set  C  of  clauses. 

o 

Lemma  3.3.3:  Let  L  be  large  enough  so  that  |c|<L.  Then,  there  exists  a  truth  assign¬ 
ment  for  U  for  which  at  least  K  clauses  in  C  are  satisfied,  if  and  only  if  there  exists 
a  truth  assignment  for  X,Z*such  that  the  resulting  cost  J  is  less  or  equal  than 
35nL  +  |c[-K. 

Proof:  (i)  Given  an  assignment  for  U,  with  at  least  K  clauses  satisfied,  assign  the 
variables  in  X,Z  as  follows: 

*U-2LfV  V  Xi3=Zi3=^  ' 

Using  Lemma  3.3.2  and  the  identity  (3. 3 .4),  the  resulting  cost  in  35nL  (i.e.  35L  from 

each  collection  D. ,  i=l,...,n)  plus  the  number  of  clauses  in  D  which  are  not  satisfied 
i  o 

(since  these  carry  unit  weight) .  The  latter  number  is  identical  to  the  number  of 
clauses  in  C  which  are  not  satisfied,  which  is  less  or  equal  than  |c|-K. 

(ii)  Conversely,  given  an  assignment  for  X,Z  such  that  J<_  35nL  +  |c|-K,  suppose  that 
for  some  ie{l,...,n},  J  >_  37L.  Using  Lemma  3.3.2  and  the  inequality  |c|<L,  we  obtain 

n 

J  >  7  J.  >  35nL  +  2L  >  35nL  +  I C I -K 

"i=l 

which  is  a  contradiction  and  shows  that  =  35L,  Vi.  Consequently,  ^x^j_,xi2,Zil,Zi2^ 
have  been  assigned  values  in  one  of  the  two  ways  suggested  by  Lemma  3.3.2.  We  now 
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assign  truth  values  for  U,  by  setting  u.=x.„.  Then  J  is  the  number  of  clauses  in 

r  il  o 

C  which  are  not  satisfied.  Moreover,  since  =  35L,  ie{l,...,n},  it  follows  that 
Jq  £  |c|-K,  which  implies  that  at  least  K  clauses  in  C  are  satisfied.  This  completes 
the  proof  of  Lemma  3.3.3.o 

It  is  easy  to  see  that  the  above  reduction  of  MAX- 2- SAT  to  P  is  polynomial. 
Therefore,  P  is  NP-complete  and  so  is  DD,  thus  completing  the  proof  of  the  Theorem.# 

It  should  be  pointed  out  that  Theorem  3.3.1  remains  valid  if  the  problem  is 
modified  so  that  the  fusion  center  uses  some  other  rule  for  combining  the  messages  it 
receives  (e.g.  uq=  u  (1-u_)),  or  if  the  combining  rule  is  left  free  and  the  fusion 

center  is  supposed  to  find  and  use  an  optimal  such  rule. 

Let  us  now  interpret  Theorem  3.3.1.  Although  it  is,  in  a  sense,  a  negative 
result,  it  can  be  useful  in  suggesting  meaningful  directions  for  future  research: 
instead  of  looking  for  efficient  exact  algorithms,  the  focus  should  be  on  approximate 
ones.  (In  fact,  it  is  an  interesting  research  question  whether  polynomial  approximate 
algorithms  for  DD  exist).  Theorem  3.3.1  also  shows  that  any  necessary  conditions  to  be 
developed  for  problem  DD  will  be  deficient  in  one  of  two  respects: 

a)  Either  there  will  be  a  very  large  number  of  decision  rules  satisfying 
these  conditions, 

b)  Or,  it  will  be  hard  to  find  decision  rules  satisfying  these  conditions. 

Another  consequence  of  Theorem  3.3.1  is  that  optimal  decision  rules  are  not  given,  in 
general,  in  terms  of  thresholds  for  the  likelihood  ratio,  because  in  that  case  an 
efficient  algorithm  could  be  obtained.  Of  course,  this  fact  can  be  also  verified 
directly  by  constructing  appropriate  examples.  When  the  condition  (3.3.1)  holds  and 
decision  miles  are  given  in  terms  of  thresholds,  the  decision  rule  of  a  processor  can 
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be  viewed  as  a  tentative  local  decision,  submitted  to  the  fusion  center.  In 
general,  however,  optimal  decision  rules  are  not  threshold  rules  and  this  inter¬ 
pretation  is  no  more  valid.  Rather  DD  should  be  viewed  as  a  problem  of  optimal 
quantization  of  the  observation  of  each  processor.  In  that  respect.  Theorem  3.3.1 
is  reminiscent  of  the  result  of  Garey,  Johnson  and  Witsenhausen  [1982] ,  namely  that 
the  general  problem  of  minimum  distortion  quantization  is  NP-complete. 


3.4  RELATED  PROBLEMS 

The  best  known  static  decentralized  problem  is  the  team  decision  problem 
[Marschak  and  Radner,  1972]  which  admits  an  easy  and  elegant  solution  under  linear 
quadratic  assumptions.  Its  discrete  version  is  the  following: 

Team  Decision  Problem  (TDP) :  Given  finite  sets  Y^ ,  Y2,  U^,  U2»  a  rational  probability 
mass  function  p:  Y^xY^Q  and  an  integer  cost  function  c:  Y^xY^xU^xU^+N,  find  decision 
rules  y^:  i=l»2  which  minimize  the  expected  cost 


J(VV 


ylOTl 


£  c(y1'y2'YlCyi) ,Y2(y2} )p(yi'y2) 
Y2^2 


Another  problem  related  to  DS  is  the  following: 


Maximize  Probability  of  Satisficing  (MPS) :  Given  finite  sets  Y  ,  Y_,  U  ,  U„  a 

X  Z  X  « 

probability  mass  function  p:  Y^xY^-^2  and  a  function 
°1XU2 

S:  Y^xY^-^2  ,  find  decision  rules  Y^=  Y/-*IL  ,  X*l,2,  which  maximize 


J(Y1»Y2)  ■  p[(yi(yx>  'Y2(Y2))es(y1,y2)] 


(which  is  the  probability  of  making  a  satisfactory  decision) . 


Given  an  instance  of  TDP,  let 


S(y1/Y2)  =  {(u^u^:  c  (y^ » y  2  /  )  =0  }  . 

If  we  solve  TDP,  we  also  effectively  answer  the  question  whether  J (Y^ *Y2) =0 • 

But  this  is  equivalent  to  solving  the  instance  of  DS  associated  to  the  above  definition 
of  S(y^,y2).  Therefore,  TDP  cannot  be  easier  than  DS.  The  same  argument  is  also  valid 
for  MPS.  It  then  follows  from  Theorem  3.2.1  that  TDP  and  MPS  cure  NP-hard  (that  is, 
NP-complete  or  worse)  even  if  we  restrict  to  instances  for  which  |u^ | =2 , | j =3. 

However,  even  more  is  true:  it  suffices  to  notice  that  the  problem  DD  of  the  previous 
section  is  a  special  case  of  both  TDP  and  MPS,  with  [u.J=|u2|=2.  Using  Theorem  3.3.1, 
we  obtain: 

Corollary  3.4.1:  TDP  and  MPS  are  NP-hard  even  if  we  restrict  to  instances  for  which 
| 1 = | U2 | =2 .  This  is  true  even  if  the  cost  function  c  associated  to  TDP  is  restricted 
to  take  only  the  values  0  and  1. 

We  could  also  define  dynamic  versions  of  DS  or  of  the  team  problem,  in  a  straight¬ 
forward  way  [Tenney,  1983] .  Since  dynamic  problems  cannot  be  easier  than  static 
ones,  they  are  automatically  NP-hard. 

Corollary  3.4.1  states  that  unlike  the  linear  quadratic  case,  the  team  decision 
problem  is,  in  general,  a  hard  combinatorial  problem.  The  reason  for  this  difference 
is  the  following:  in  the  linear  quadratic  problem,  Radner ' s  theorem  [Radner,  1962] 
guarantees  that,  as  a  consequence  of  the  convex  structure  of  the  problem,  a  person- 
by-person  optimal  decision  rule  is  also  team  optimal.  This  is  no  longer  true  for 
nonconvex  (for  example,  discrete)  team  problems.  While  it  may  be  argued  that  finding 
person-by-person  optimal  decision  rules  is  relatively  easy,  there  is  no  simple  criterion 


for  deciding  whether  a  given  person-by-person  optimal  decision  rule  is  also  team 
optimal  and  this  is  really  the  source  of  the  difficulty.  Let  us  also  stress  here 
that  difficulties  arising  from  the  possibility  of  multiple  person-by-person  optima 
are  equally  relevant  to  team  decision  problems  formulated  on  continuous  decision  and 
probability  spaces,  as  is  commonly  done.  In  other  words,  the  difficulties  do  not  arise 
because  we  discretize  an  otherwise  easy  problem,  but  they  are  of  a  more  fundamental 
nature  [Papadimitriou  and  Tsitsiklis,  1984]. 

3.5  ON  DESIGNING  COMMUNICATIONS  PROTOCOLS 

Suppose  that  we  are  given  an  instance  of  the  distributed  satisficing  problem  (DS) 
and  that  it  was  concluded  that  unless  the  processors  communicate,  satisficing  cannot 
be  guaranteed  for  all  possible  observations.  Assuming  that  communications  are  allowed 
(but  are  costly) ,  we  have  to  consider  the  problem  of  designing  a  communications 
protocol:  what  should  each  processor  communicate  to  the  other,  and  at  what  order? 
Moreover,  since  communications  are  costly,  we  are  interested  in  a  protocol  which 
minimizes  the  total  number  of  binary  messages  (bits)  that  have  to  be  communicated  in 
the  worst  case. 

Before  proceeding,  we  must  make  more  precise  the  notion  of  a  communication  pro¬ 
tocol  and  of  the  number  of  bits  that  guarantee  satisficing. 

Given  an  instance  P=(Y  ,Y2#U  ,U  ,I,S)  of  the  problem  DSI  we  will  say  that: 

There  is  a  protocol  which  guarantees  satisficing  with  0  bits  of  communications, 
if  V  is  a  YES  instance  of  the  problem  DSI.  (That  is,  if  there  exist  satisfioi-g 
decision  rules ,  involving  no  communications . ) 


We  then  proceed  inductively: 


There  is  a  protocol  which  guarantees  satisficing  with  K  bits  of  communications 
(Ke  N)  ,  if  for  some  ie{i,2}  (say,  i=l)  there  is  a  function  m:  Y^-fO,!},  such  that 
for  each  of  the  instances  P'  =  (Y^n“>  1(0)  ,Y2,U^,U2,in  [  (Y^Om  ^(0))xY2],S)  and 
V"=  (Yxn  m  1  (1)  ,Y2,U^,U2,in  [  (Y^  Dm  *(1)  )xY2]  ,S)  there  is  a  protocol  which  guarantees 
satisficing  with  not  more  than  K-l  bits  of  communications.  (Here  m  ^(i)={y^eY^:m(y^)  = 
The  envisaged  sequence  of  events  behind  this  definition  is  the  following:  Each 
processor  observes  its  measurement  y.6Y^,  i=l,2.  Then,  one  of  the  processors,  say 
processor  1,  transmits  a  message i=m(y^) , with  a  single  bit  to  the  other  processor. 

From  that  point  on,  it  has  become  common  knowledge  that  y^eY^Om  ^(i);  therefore,  the 
remaining  elements  of  Y^  may  be  ignored. 

We  can  now  state  formally  the  problem  of  interest: 

MBS  (Minimum  bits  to  satisfice) :  Given  an  instance  V  of  DSI  and  Ke  N,  is  there  a 
a  protocol  which  guarantees  satisficing  with  not  more  than  K  bits  of  communications? 

By  definition,  MBS  with  K=0  is  identical  to  the  problem  DSI.  Moreover,  MBS  with 
K  arbitrary  cannot  be  easier  them  MBS  with  K=0  (which  is  a  special  case).  Therefore, 
MBS  is,  in  general  NP-hard.  Differently  said,  problems  involving  communications  are 
at  least  as  hard  as  problems  involving  no  communications. 

We  have  seen  in  Section  3.2  that  when  | | = | | =2 ,  DSI  may  be  solved  in  poly¬ 
nomial  time.  Therefore,  MBS  with  K=0,  |u^|=2,  |u2|=2  is  polynomially  solvable. 
However,  for  arbitrary  K,  this  is  no  longer  true: 

Theorem  3.5.1:  MBS  is  NP-complete,  even  if  | | = | U2 j ={0 ,1}  and  even  if  we  restrict 
to  instances  for  which,  for  any  (y.»y0)ei,  either  S(y  ,y  )={(0,0)}  or  S(y  ,y  )={ (1,1) } 


The  above  theorem  proves  a  conjecture  of  A.  Yao  [Yao,  1979]  .  The  proof  was 


mainly  constructed  by  C.  Papadimitriou  and  may  be  found  in  [Papadimitriou  and 
Tsitsiklis ,  1982 ] . 

We  should  point  out  that  the  special  case  referred  to  in  Theorem  3.5.1  concerns 
the  problem  of  distributed  function  evaluation:  we  are  given  a  Boolean  function 
f:  Y  xY  -*{0,1}  and  we  require  that  both  processors  eventually  determine  the  value  of  the 

x  * 

function  (given  the  observation  -  input  (y1#y_)),  by  exchanging  a  minimum  number  of  bits 
In  our  formalism,  S (y1,y2> ={ (0 ,0) }  if  f(y1,y2)=0  and  S (y1,y2> ={ (1,1) }  if  f(y1,y2)=l. 

In  Section  3.2  we  had  investigated  the  complexity  of  DSI  by  restricting  to  ins¬ 
tances  for  which  the  set  I  had  constant  degree  (D  ,D2)  *  This  may  be  done,  in  principle, 
for  MBS,  as  well,  but  no  results  are  available,  except  for  the  simple  case  in  which 

W2- 

In  fact,  when  D1=D2==2  each  processor  may  transmit  its  information  to  the 
other  processor  by  communicating  a  single  binary  message  and,  for  this  reason,  we  have; 

Proposition  3.5.1:  MBS  restricted  to  instances  for  which  D1=D2=2  may  be  solved 
in  polynomial  time.  Moreover,  an  optimal  protocol  requires  transmission  of  at  most 
two  binary  messages, one  from  each  processor. 

When  (D^ ,D2)  is  larger  than  (2,2),  there  is  not  much  we  cm  say  about  optimal 
protocols.  However,  it  is  easy  to  verify  that  there  exist  fairly  simple  non-optimal 
protocols  (which  may  be  calculated  in  polynomial  time)  which  involve  relatively  small 
amounts  of  communication.  This  is  because: 

Proposition  3.5.2:  Suppose  that  I  has  degree  (D^D^  and  that  S(y^,y2)  =  $,  Vty^y^ei. 
Then  information  may  be  centralized  (and  therefore  satisficing  is  guaranteed)  by 


means  of  a  protocol  requiring  communication  of  at  most  [log^CD^D^)]  binary  messages 
by  each  processor.  Moreover,  such  a  protocol  may  be  constructed  in  time 
0((|Y  j - | Y2 | ) (|y  |+|Y  |)) .  (Here  [x] ,x6R,  stands  for  the  smallest  integer  larger 
than  x. ) 

Remark:  It  might  be  tempting  to  guess  that  processor  1  (respectively  2)  needs  to 

communicate  only  [log^D^  (respectively  [log^D^])  bits ,  but  this  is  not  true,  as  can 
be  seen  from  fairly  simple  examples.* 

Proof :  Consider  the  (undirected)  graph  G=  (Y^U  Y2 , 1)  .  Here  Y^(jY2  the  set 
nodes  (we  assume  that,  by  possibly  renaming  elements,  Y  Dy  =<j>)  and  I  is  the  set  of 
edges.  Note  that  G,  is  a  bipartite  graph.  Each  yeY^  is  connected  to  at  most 
other  nodes. 

We  will  show  that  we  may  colour  the  nodes  of  G  so  that,  if  (y^y^ei  and 
(.y^,y’2)ei,  then  y2  and  y^  have  different  colours;  similarly,  if  (y^y^ei,  (y^,y2)ei, 
then  yL,y^  will  have  different  colours.  Moreover,  at  most  colours  will  be  used. 

Then,  each  processor  i,  may  transmit  to  the  other  the  colour  of  his  own  observation 
( [log2 (D^D2> ]  binary  messages)  and  the  other  processor  will  be  able  to  infer  the  exact 
value  of  the  observation  of  the  first  one. 

The  above  mentioned  colouring  may  be  accomplished  as  follows:  We  first  colour 
the  elements  of  Y^ .  Suppose  that  the  first  k  elements  y^^,....,y^  of  Y^  have  been 
coloured.  Consider  the  k+1  element  y^+D  .  if  for  some  i£k,  there  exists  y2eY2 


♦Nevertheless,  there  exist  protocols  for  centralizing  information  which  are  more  effi¬ 
cient  than  the  one  we  construct  here  (A.  El  Gamal,  private  communication) .  See  also 
[Witsenhausen,  1976]  for  a  related  problem. 
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such  that  (y^C+^  ,y^)ei  and  (y|^  ,7^)61 ,  then  y^+1^  must  be  colored  differently  from 

y|^  .  Now,  y^"1-^  is  connected  to  at  most  elements  of  and  each  such  element  of 

Y  is  connected  to  at  most  D  -1  elements  of  Y  ,  other  than  y  .  This  means  that 
^  «  11 


(k+1) 


at  most  D^(D2“1)<  D^D2-1  colours  are  prohibited  for  y^  .  Therefore,  if  D.^2 


colours  are  available,  y^+^  may  be  colourea  in  the  desired  way,  and  so  on.  Nodes  in 
*2  may  be  also  coloured  in  the  same  way. 

With  this  algorithm,  for  each  to  be  coloured  we  must  examine  | Y  | - | Y^ | 

elements  (y^,y2)  to  check  whether  (y|k+^  ,y2)ei  and  (y^,y  )ei.  So,  the  set  Y1  is 

2 

coloured  in  time  | Y^  j  |y2|  and  the  construction  of  the  desired  protocol  takes  time 


od^M^I  (|y1|  +  |y2|))., 


3.6  CONCLUSIONS 

We  summarize  here  the  main  conclusions  from  the  investigations  of  this  Chapter. 

Even  if  a  set  of  processors  have  complete  knowledge  of  the  structure  of  a  decentral 
ized  decision  making  problem  and  the  desired  goal?  even  if  the  corresponding  centralized 
problem  is  trivial;  even  if  all  relevant  sets  are  finite,  a  satisficing  decision  rule 
that  involves  no  on-line  communications  may  be  very  hard  to  find,  the  corresponding 
problem  being,  in  general,  NP-complete.  There  are  many  objections  to  the  idea  that 
NP-completeness  is  an  unequivocal  measure  of  the  difficulty  of  a  problem,  because  it 


is  based  on  a  worst  case  analysis,  whereas  the  average  performance  of  an  algorithm 
might  be  a  more  adequate  measure;  moreover  NP-hard  optimization  problems  may  have  very 
simple  approximate  algorithms.  However,  NP-complete  problems  are  often  characterized 
by  the  property  that  any  known  algorithm  is  very  close  to  systematic  exhaustive  search; 
they  do  not  possess  any  structure  to  be  exploited.  Furthermore,  NP-completeness  of  a 
discrete  problem  is  an  indication  that  the  corresponding  continuous  problem  is  very 
likely  to  be  hard  as  well. 

Concerning  the  problem  DS,  and  its  variations ,  we  may  reach  the  following 
specific  conclusions:  No  simple  algorithm  could  solve  DS.  Given  that  communications 
would  be  certainly  required  for  those  instances  of  DS  that  possess  no  satisficing 
decision  rules,  it  would  not  be  a  great  loss  if  we  allowed  the  processors  to  communicate 
even  for  some  instances  of  DS  for  which  this  would  not  be  necessary.  Even  if  these 
extra  communications-being  redundant  -  do  not  lead  to  better  decisions,  they  may  greatly 
facilitate  the  decision  process  and  -from  a  practical  point  of  view  -  remove  some  load 
from  the  computing  machines  employed. 

Concerning  the  problem  of  distributed  detection,  we  have  shown  that  it  becomes 
hard,  once  a  simplifying  assumption  of  conditional  independence  is  removed.  This  ex¬ 
plains  why  no  substantial  progress  on  this  problem  had  followed  the  work  of  Tenney  and 
Sandell  [1982] . 

From  a  more  general  perspective,  we  are  in  a  position  to  say  that  the  basic 
(and  the  simplest)  problems  of  decentralized  decision  making  are  hard,  in  a  precise 
mathematical  sense.  Moreover,  their  difficulty  does  not  only  arise  when  one  is  in¬ 
terested  in  optimality.  Difficulties  persist  even  if  optimality  is  replaced  by 
satisficing.  As  a  consequence,  further  research  should  focus  on  special  cases  and 
easily  solvable  problems  as  well  as  an  approximate  versions  of  the  original  problems. 


In  cases  where  communications  are  necessary  (but  costly)  there  arises  naturally 
the  problem  of  designing  a  protocol  of  communications.  Unfortunately,  if  this  problem 
is  approached  with  the  intention  to  minimize  the  amount  of  communications  that  will 
guarantee  the  accomplishment  of  a  given  goal,  we  are  again  led  to  intractable  com¬ 
binatorial  problems.  Therefore,  practical  communications  protocols  can  only  be  designed 
on  a  "good"  heuristic  or  ad-hoc  basis,  and  they  should  not  be  expected  to  be  optimal; 
approximate  optimality  is  probably  a  more  meaningful  goal.  Again,  allowing  some  redun¬ 
dancy  in  on-line  communications  may  lead  to  substantial  savings  in  off-line 
computations . 


CHAPTER  4:  CONVERGENCE  AND  ASYMPTOTIC  AGREEMENT  OF  PROCESSORS 
INTERESTED  IN  A  COMMON  DECISION 

4.1  INTRODUCTION  AND  MOTIVATION 

There  are  many  situations  in  which  several  processors  (decision  makers)  with 
different  on-line  information  (input)  have  to  cooperate,  combine  their  information 
and  arrive  at  a  common  decision.  The  distributed  function  evaluation  problem  of 
Section  3.5  is  an  example.  There  are  two  issues  involved  here:  first,  the  processors 
must  reach  consensus  and,  second,  their  final  decision  must  be  a  desirable  one,  in 
some  sense.  Concerning  the  issues  of  what  is  a  desirable  decision,  we  abandon  the 
"satisficing"  point  of  view  of  Chapter  3  and  we  introduce  a  cost  function.  We  will 
take,  however,  a  more  pragmatic  approach  than  in  Chapter  3:  we  will  not  insist  that 
final  decisions  be  as  desirable  as  possible  (that  is,  optimal).  We  concentrate  on 
the  requirement  that  consensus  must  be  reached  and,  we  require,  as  a  secondary  issue, 
that  the  final  decision  takes  into  account  the  cost  function  involved  and  that  it  is 
better  than  the  decision  that  each  processor  would  make  if  it  were  to  rely  exclusively 
on  its  own  information. 

To  motivate  our  approach,  we  are  interested  in  situatiors  in  which 

(i)  There  are  rigid  time  limits  within  which  preliminary  or  final  decisions 
must  be  made. 

(ii)  There  are  communications  limitations,  restricting  the  number  and  the  nature 
of  the  messages  that  can  be  exchanged. 

(iii)  Conflict  resolution  procedures  involving  higher  levels  of  the  decision  making 
hierarchy,  are  undesirable  because  they  are  likely  to  result  in  delays  and 
tend  to  overload  these  higher  levels . 

We  will  be  looking  for  a  scheme  which  leads  to  consensus,  while  taking  into  considera¬ 
tion,  directly  or  indirectly,  the  above  requirements. 
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First  and  foremost,  any  scheme  that  would  involve  the  centralization  of  all 
data  (by  communicating  them  to  a  predetermined  processor)  should  be  considered 
undesirable  for  the  following  reasons:  it  is  often  the  case  that  too  many  data  are 
available,  which  would  overload  communications  channels;  moreover  good  decisions  can 
often  be  made  on  the  basis  of  aggregations  of  the  initial  data;  furthermore,  if  a 
processor  is  a  model  of  a  human,  the  plethora  of  data  would  saturate  his  short-term 
memory.  This  implies  that  it  is  preferable  to  communicate  a  few  aggregate  data. 
Determining  an  optimal  way  to  "aggregate"  is  not  a  well- posed  problem.  If  constraints 
are  placed  on  the  number  of  bits  to  be  transmitted,  the  problem  becomes  computationall 
intractable,  even  if  there  is  only  a  finite  number  of  possible  events  and  decisions 
(Sections  3.5).  On  the  other  hand,  if  a  message  is  allowed  to  be  any  real  number, 
all  data  can  be  coded  in  a  single  message.'  Also,  for  any  fixed  aggregation  protocol, 
a  processor  could  slightly  change  its  message  and  code  all  information  in  the  least 
significant  bits  of  the  message.  (This  is  reminiscent  of  decentralized  control  pro¬ 
blems  in  which  a  processor  may  observe  the  decisions  of  other  processors  -  the  so- 
called  control  sharing  pattern  [Aoki,  1973;  Sandell  and  Athans,  1974]).  Any  such 
trick  is  very  sensitive  to  noise  in  the  channel  and  is  effectively  just  a  more  com¬ 
plicated  way  of  centralizing  information.  Since  centralization  was  deemed  undesirable 
in  the  first  place,  any  indirect  way  of  centralizing  should  be  also  undesirable  and 
explicitly  prohibited.  . 

The  above  discussion,  as  well  as  the  conclusions  in  Section  3.6,  imply  that  a 
particular  aggregation  of  the  data  should  be  chosen  by  means  of  some  ad-hoc  rule 
that  guarantees  that  certain  desirable  characteristics  are  present.  Unless  some 
particular  structure  on  the  problem  is  assumed,  the  optimal  decision  (given  a 
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processor  ' s  information)  seems  to  be  a  very  natural  message  that  a  processor 
could  transmit  and  this  is  precisely  the  protocol  that  we  study.  Of  course,  a 
demonstration  that  our  scheme  leads  to  (approximate)  consensus  with  fewer  communica¬ 
tions  and/or  computations  than  direct  centralization  has  to  rely  on  numerical  ex¬ 
perimentation  and  the  answer’will  depend  on  the  specific  situation. 

Problem  Description  and  Overview 

We  consider  the  following  situation:  A  set  {l,...,N}  of  N  processors  possessing 
a  common  model  of  the  world  (same  prior  probabilities)  and  having  the  same  cost  func¬ 
tion  (common  objective)  want  to  make  an  optimal  decision.  Each  processor  bases  its 
decision  on  a  set  of  observations  it  has  obtained  and  we  allow  these  observations  to 
be  different  for  each  processor.  Given  this  setting,  the  decisions  of  the  processors 
will  be  generally  different.  Aumann  [19 76}  has  shown,  however,  that  agreement  is 
guaranteed  in  the  following  particular  case:  If  the  decision  to  be  made  is  the  eval¬ 
uation  of  the  posterior  probability  of  some  event  and  if  all  processors'  posteriors 
are  common  knowledge,  then  all  processors  agree.  (In  Aumann' s  terminology,  common 
knowledge  of  an  event  means  that  all  processors  know  it,  all  processors  know  that 
all  processors  know  it,  and  so  on,  ad  infinitum.) 

The  situation  where  each  processor's  posterior  is  common  knowledge  is  very 
unlikely,  in  general.  On  the  other  hand,  if  agreement  is  to  be  guaranteed,  posteriors 
have  to  be  common  knowledge .  The  problem  then  becomes  how  to  reach  a  state  of 
agreement  where  decisions  are  common  knowledge,  starting  from  an  initxal  state  of 
disagreement. 


Geanakopoulos  and  Polemarchakis  [1978]  and  Borkar  and  Varaiya  [1982]  gave  the 
following  natural  solution  to  the  above  problem:  Namely,  processors  start  communica¬ 
ting  to  each  other  their  tentative  posteriors  (or,  in  the  formulation  of  [Borkar 
and  Varaiya,  1982]  the  conditional  expectation  of  a  fixed  random  variable)  and  then 
update  their  own  posterior,  taking  into  account  the  new  information  they  have  re¬ 
ceived.  In  the  limit,  each  processor's  posterior  converges  (by  the  martingale 
convergence  theorem)  and  assuming  that  "enough"  communications  have  taken  place, 
they  all  have  to  converge  to  a  common  limit. 

The  above  results  hold  even  when  each  processor  obtains  additional  raw  obser¬ 
vations  during  the  adjustment  process  and  when  the  history  of  communications  is 
itself  random.  Similar  results  were  also  proved  for  a  detection  problem  [Borkar 
and  Varaiya,  1982]. 

A  related  -  and  much  more. general  situation  is  studied  in  this  chapter;  we  as¬ 
sume  that  the  processors  are  not  just  interested  in  obtaining  an  optimal  estimate  or 
a  likelihood  ratio,  but  their  objective  is  to  cry  to  minimize  some  common  cost  func 
tion,  given  the  available  information.  (Clearly,  if  each  processor  has  a  different 
cost  function  no  agreement  is  possible  even  if  each  processor  had  identical  informa¬ 
tion)  .  In  this  setting,  we  assume  that  processors  communicate  to  each  other  tentativ' 
decisions  (which  initially  will  be  different) .  That  is,  at  any  time,  a  processor 
computes  an  optimal  decision  given  the  information  it  possesses  and  communicates  it 
to  other  processors .  Whenever  a  processor  receives  such  a  message  from  another 
processor  its  information  essentially  increases  and  it  will,  in  general,  update  its 
own  tentative  decision,  and  so  on.  We  prove  that  the  qualitative  results  obtained 
in  [Genakopoulos  and  Polemarchakis,  1978;  Borkar  and  Varaiya,  1982]  for  the  estimatio 


problem  (convergence  and  asymptotic  agreement)  are  also  valid  for  the  decision 
making  problem  for  several,  quite  general,  choices  of  the  structure  of  the  cost 
function.  However,  tentative  decisions  do  not  form  a  martingale  sequence  and  a 
substantially  different  mathematical  approach  is  required  for  the  proofs.  We  point 
out  that  estimation  problems  are  a  special  case  of  the  decision  problems  studied  in 
this  Chapter,  being  equivalent  to  the  minimization  of  the  mean  square  error. 

A  drawback  of  the  above  setting  is  that  each  processor  is  assumed  to  have  an 
infinite  memory.  We  have  implicitly  assumed  that  the  knowledge  of  a  processor  can 
only  increase  with  time  and,  therefore,  it  has  to  remember  the  entire  sequence  of 
messages  it  has  received  in  the  past.  There  is  also  the  implicit  assumption  that  if 
a  processor  receives  additional  raw  data  from  the  environment,  while  the  communication 
process  is  going  on,  these  data  are  remembered  forever.  These  assumptions  are 
undesirable,  especially  if  the  processors  are  supposed  to  model  humans,  because 
limited  memory  is  a  fundamental  component  of  the  bounded  rationality  behavior  of 
human  decision  makers  [Simon,  1980] .  We  will  therefore  relax  the  infinite  memory 
assumption  and  allow  the  processors  to  forget  any  portion  of  their  past  knowledge. 

We  only  constrain  them  to  remember  their  most  recent  decision  and  the  most  recent 
message  (tentative  decision)  coming  from  another  processor.  (For  a  particular  class 
of  communication  protocols,  we  even  allow  them  to  forget  their  most  recent  decision) . 
We  then  obtain  convergence  results  similar  to  those  obtained  for  the  unbounded  memory 
model,  although  in  a  slightly  weaker  sense. 

A  particular  problem  of  interest  is  one  in  which  all  random  variables  are 
jointly  Gaussian  and  the  cost  is  a  quadratic  function  of  an  unknown  state  of  the 
world  and  the  decision.  It  was  demonstrated  in  [Borkar  and  Varaiya,  1982]  that 


the  common  limit  to  which  decisions  converge  (for  the  estimation  problem)  is 
actually  the  centralized  estimate,  i.e.  the  estimate  that  would  be  obtained  if  all 
processors  were  to  communicate  their  detailed  observations.  We  prove  (Section  4.4) 
that  the  same  is  true  in  the  presence  of  memory  limitations,  provided  that  each 
processor  never  forget  its  own  raw  observations.  (That  is,  it  may  only  forget  past 
tentative  decisions  sent  to  it  by  other  processors) .  We  indicate  that,  for  linear 
quadratic  Gaussian  (LQG)  problems,  our  scheme  is  essentially  a  decomposition  algorithm 
for  solving  static  linear  estimation  problems.  As  we  point  out  in  Section  4.4,  this 
scheme  has  certain  appealing  features:  there  is  significant  parallelism  in  the  com¬ 
putations  which  matches  nicely  with  the  assumed  distribution  of  the  data;  also,  in 
the  course  of  the  algorithm,  acceptable  estimates  are  obtained  much  earlier  than  the 
time  that  would  be  needed  to  compute  the  optimal  estimate  by  centralizing  the  infor¬ 
mation  .  These  tentative  estimates  can  be  very  useful  whenever  there  are  strict  time 
limits  within  which  certain  decisions  have  to  be  made. 

We  also  consider  (Section  4.5)  a  slightly  different  scheme  in  which  each  proces-  ' 
sor  transmits  its  tentative  decision  to  a  coordinator.  The  latter  evaluates  a 
weighted  average  of  the  tentative  decisions  it  has  received  and  sends  it  back  to  all 
processors.  We  show  that  our  results  remain  valid  for  this  scheme  as  well  and  suggest 
an  economic  interpretation  in  which  the  coordinator  can  be  viewed  as  some  sort  of 
market  mechanism.  We  also  show  that  making  optimal  tentative  decisions  corresponds 
to  Nash  strategies  for  a  certain  sequential  game . 

A  weak  point  ol  the  model  is  that,  not  only  each  processor  has  the  same  prior 
information  and  knows  the  statistics  of  the  other  processors'  observations,  but  also 


has  the  same  model  of  the  probabilistic  mechanism  that  generates  inter-processor 
communications.  In  particular,  if  this  is  a  deterministic  mechanism,  a  processor 
must  know  the  precise  history  of  communications  between  any  pair  of  other  processors, 
a  strong  requirement.  If  it  is  a  stochastic  mechanism,  then  there  are  two  possibil¬ 
ities:  either  the  history  of  communications  becomes  commonly  known  on-line  (at  the 
expense  of  additional  communications)  or  each  processor  will  have  to  make  probabil¬ 
istic  inferences  about  the  communications  between  all  other  processors.  These 
weaknesses  disappear,  however,  if  every  tentative  decision  is  broadcast  simultaneously 
to  all  other  processors,  at  each  stage.  In  that  case  the  history  of  communications 
is  simple,  commonly  known  and  easy  to  remember.  (This  will  be  the  case,  for  example, 
if  a  set  of  experts  with  the  same  objective  teleconfer  and  take  turns  into  suggesting 
what  they  believe  to  be  the  optimal  decision). 

In  Section  4.2  we  define  the  model  and  the  scheme  to  be  studied,  as  well  as 
a  few  special  cases  of  particular  interest.  Section  4.3  contains  the  main  results 
of  this  Chapter.  Section  4.4.  specializes  to  linear  quadratic  gaussian  (static 
linear  estimation)  problems.  Section  4.5  considers  the  scheme  involving  a  coordinator 
to  which  we  referred  above.  Section  4.6  discusses  the  case  of  processors  with  dif¬ 
ferent  models  and  Section  4.7  states  the  main  conclusions  of  this  Chapter. 

Most  of  the  results  in  this  chapter  appear  in  [Tsitsiklis  and  Athans,  1984]. 

4.2  MODEL  FORMULATION 

In  this  Section  we  present  a  mathematical  formulation  of  the  model  informally 
described  in  Section  4.1.  We  start  with  the  general  assumptions  and  later  proceed 
to  the  development  of  alternative  specialized  models  to  be  consider  (e.g.  memory 


limitations,  particular  forms  of  the  cost  function  etc.).  As  far  as  the  description 
of  the  sequence  of  communications  and  updates  goes,  we  basically  adopt  the  model  of 
Borkar  and  Varaiya  [1982]  except  that  time  is  considered  to  be  discrete.  As  in 
[Borkar  and  Varaiya,  1982],  events  are  timed  with  respect  to  a  common,  absolute  clock. 
As  far  as  notation  is  concerned,  we  will  use  subscripts  to  denote  time  and  superscript 
to  denote  processors. 

We  assume  that  we  are  given  a  set  {l, _ ,n}  of  N  processors,  an  underlying 

probability  space  (ft,.F,P)  and  a  real  valued  cost  function  c:ft  x  U  -*-R,  where  U  is 
the  set  of  admissible  values  of  the  decision  variable.  It  will  be  useful  in  the 
sequel  to  distinguish  between  elements  of  U  and  U-valued  random  variables.  The 
letter  v  will  be  used  to  denote  elements  of  U  whereas  u,  w  will  be  used  to  denote 
U-valued  random  variables  (measurable  functions  from  ft  to  U) . 

Assumption  4.2.1:  Either 

(4. 2. 1.1) :  U  is  a  finite  set,  or 

(4. 2. 1.2) :  U  =  Rn,  for  some  n. 

Assumption  4.2.2:  The  cost  function  c  is  nonnegative  and  jointly  measurable  in  (u),v) 
Moreover,  E[c(v)]<  00 ,  Vv€U.  When  Assumption  (4. 2. 1.2)  holds,  we  assume  that  there 

exists  a  positive  and  measurable  function  A: ft  R  such  that 

/  v  +v  \ 

A(U>)  I  I  vx-v2 1 1 2<  i  [c(w,v1)+c(w,v2)  ]  -  c  f  w,  —  )  ,  weft,  v1'v2eu 

(4.2.1) 

(Remark:  If  we  fix  v  ,  v  ,  v  /  v  and  take  expectations  of  both  sides  of  (4.2.1), 

X  4  X  ^ 

it  follows  that  A  is  integrable.) 


Inequality  4.2.1  implies  that  c  is  a  strictly  convex  function  of  v  and 


strict  convexity  holds  in  a  uniform  way,  for  any  fixed  u>eQ.  It  also  follows  that 
c(uj,v)  is  continuous  for  any  uief2.  This  assumption  is  satisfied,  in  particular,  if 
c  is  twice  continuously  differentiable  in  v  and  its  Hessian  is  positive  definite, 
uniformly  in  v,  for  any  fixed  U)€ft. 

We  may  use  the  function  A,  defined  in  Assumption  4.2.2,  to  define  a  new  measure 
y  on  (f),F)  by 


U(B) 


A(oo)dP(w)  , 


BeF  . 


(4.2.2) 


This  measure  will  be  used  in  Section  4.3. 

We  now  consider  the  generic  situation  facing  processor  i  at  some  time  n.  Let 

F^C  F  be  a  a- field  of  events  describing  the  information  possessed  by  processor  i 

at  time  n.  Because  of  Assumption  4.2.2,  the  conditional  expectation  E[c(v)|F^] 

exists  (is  finite) ,  is  F1-measurable  and  is  uniquely  determined  up  to  a  set  of  measure 

n 

zero,  for  any  fixed  veu.  Processor  i  then  computes  a  tentative  decision  u^  that 

minimizes  ECcfvjlF1].  The  following  Lemma  states  that  u1  is  well-defined  and 

n  n 

F1-measurable . 
n 

Lemma  4.2.1;  Under  Assumptions  4. 2. 1.2,  4.2.2,  there  exists  a  F^-measurable  random 

variable  u\  which  is  unique  up  to  a  set  of  measure  zero,  such  that 
n 

Efctu1) I F1]<  EtcCwjlF1]  ,  almost  surely,  (4.2.3) 

n  n  —  n 

for  any  U-valued,  F^-measurable  random  variable  w.  The  same  results  are  true, 

n 


(except  for  uniqueness)  under  Assumption  4. 2. 1.1,  4.2.2. 


Proof:  Let,  for  notational  convenience,  g(io,v)  =  E[c(v)|F  ].  Under  Assumptions 

-  n 

4. 2. 1.2  and  4.2.2,  g(w,v)  can  be  chosen  so  that  it  is  jointly  measurable,  strictly 

convex  in  v  and  converges  to  infinity  as  v  converges  to  infinity,  for  any  ojefl. 

Hence,  V^efi,  the  infimum  of  g(a),v)  is  attained  by  some  u1^)  which  is  unique, 

n 

because  of  strict  convexity. 


Let  Q 


■  w* 


be  a  countable  dense  subset  of  U.  Then,  by  continuity  of  g. 


k=l 


inf  g(uj,v)  =  inf  _r  a (u),v)  ,  Moreover,  inf^a(u>,v)  is  f^-measurable .  Let 

Vcy  Vc.U  v€Q  “ 


n 


<j>m(u>)  =  q^,  where  k  is  the  smallest  index  such  that 


<3(u,q  )<  inf  g(U),v)  +  —  . 

K  —  m 

veu 

Then  cj)  is  ^-measurable  and  converges,  for  each  to  u1.  Hence,  u1  if  F^measurabi 
m  n  n  n  n 

Inequality  (4.2.T)  now  follows  from  the  definition  of  u^  and  uniqueness  is  a 

consequence  of  strict  convexity.  The  measurability  of  u^  under  Assumption  4. 2. 1.1 

is  trivial. 

We  continue  with  a  description  of  the  process  of  communications  between  proces¬ 
sors.  When,  at  time  n,  processor  i  computes  its  tentative  optimal  decision  u  ,  it 

n 

may  communicate  its  realized  value  (say  v1)  to  any  other  processor.  (If  v1  is  not 

n  n 

unique,  a  particular  minimizing  v^  is  selected  according  to  some  commonly  known  rule. 
Whether,  when  and  to  which  processors  v^  is  to  be  sent  is  a  random  event whose  statis¬ 
tics  are  described  by  (Q,F,P) .  In  particular,  it  may  depend  on  the  data  possessed  by 
at  time  n.  So,  we  implicitly  allow  the  processors  to  influence  the  process  of  com¬ 
munications,  although  we  do  not  require  this  influence  to  be  optimal  in  any  sense. 
This  allows  the  possibility  of  signalling  additional  information,  beyond  that  containe 
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« 


in  v1,  by  appropriately  choosing  when  and  to  which  processors  to  communicate.  We  allow 
n 

the  communication  delays  to  be  random  but  finite.  We  also  assume  that  when  a 
processor  receives  a  message  it  knows  the  identity  of  the  processor  who  sent  it. 

We  now  impose  conditions  on  the  number  of  messages  to  be  communicated  in  the 
long  run;  these  conditions  are  necessary  for  agreement  to  be  guaranteed.  Namely, 
we  require  that  there  is  an  indirect  communication  link  from  any  processor  to  any 
other  processor  which  is  used  an  infinite  number  of  times.  This  can  be  made  precise 
as  follows ; 

Let  A(i)  be  the  set  of  all  processors  that  send  an  infinite  number  of  messages 
to  processor  i,  with  probability  1.  Then,  we  make  the  following  assumption: 


< 


< 


Assumption  4.2.3:  There  is  a  sequence  m^, *  * • not  necessarily  distinct 

processors  such  that  m.6A(m.  ),  i=l,2,...,k.  Each  processor  appears  at  least  once 

1  1  +  1 

in  this  sequence.  < 

The  main  consequence  of  Assumption  4.2.3,  which  will  be  repeatedly  used,  is  the  ■ 

following:  If  {h1;  1, . . .  ,N}  is  a  set  of  numbers  such  that  h^h"1 ,  yjeA(i)  ,  \Ji.  then 

h^h3,  Vi.j  •  < 


We  continue  with  a  more  detailed  specification  of  the  operation  of  the  processors. 

We  introduce  assumptions  on  the  knowledge  p1  which  are  directly  related  to  the  pro- 

n 

perties  of  the  memory  of  processor  i.  A  processor  may  receive  (at  any  time)  observa¬ 
tions  on  the  state  of  the  world  or  receive  tentative  decisiors  (messages)  of  other 
processors.  The  knowledge  of  a  processor  at  some  time  will  be  a  subset  (depending  on 


I 


tne  properties  of  its  memory)  of  the  total  information  it  has  received  up  to  that 


time.  We  consider  four  alternative  models  of  memory,  formalized  with  the  four  assump¬ 
tions  that  follow. 

Let  w^  be  any  message  received  by  processor  i  at  time  n.  Our  most  general  as¬ 
sumption  requires  that  w1  and  u1 


( 


sure  remembered  at  time  n: 
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Assumption  4.2.4:  (Imperfect  Memory)  For  all  n,  the  a- field  F  is  such  that 
-  n 

u1  „  and  w1  are  F1-measurable . 
n-1  n  n 

Assumption  4.2.4  can  be  further  weakened  if  some  restrictions  are  imposed  on 


the  communications  protocol:  -  • 

■  -  v  « 

Assumption  4.2.5:  (Imperfect  Memory)  For  each  n  there  exists  a  set  I(n)  of  proces-  -  V 
sors  such  that:  'v- 

a)  ui  is  F^ -measurable,  Viei(n-l),  vj6I(n) 

n-1  n 

b)  F1  =  F1  ,  Vi  not  in  I (n) . 

n  n-1  i 

i 

■i 

Intuitively,  I(n)  is  the  set  of  processors  that  update  their  decision  at  time 

! 

•  '  * 

n.  Assumption  4.2.5  is  satisfied  by  the  following  two  common  communication  protocols  ..j 
provided  that  processor  i  may  obtain  additional  observations  only  at  times  such  that  1 
iei(n): 

Ring  Protocol:  I (n)  =  {k},  where  k  is  the  unique  integer  such  that  l£k<N  and 
k+mN=n,  for  some  integer  m.  Here,  exactly  one  processor  updates  at  any  time  instance 

"■ •  .  i 

cind  commiinicates  its  tentative  decision  to  the  next  .processor  and  so  on. 

Star  Protocol:  I (n)  =  {l,...,N-l},  if  n  is  odd;  I (n)  =  {n},  if  n  is  even.  Here  all  _) 
processors  but  the  last  one  update  simultaneously,  communicate  to  the  last  processor : 


who  updates  and  communicates  to  all  other  processors  and  so  on. 


Assumption  4.2.6:  (Own  Data  Remembered)  Let  Gn  be  the  subfield  of  F  describing  all 

information  that  has  been  observed  by  processor  i  up  to  time  n,  except  for  the  message 

of  other  processors.  We  assume  that  G1C  F1. 

n  n 


With  Assumption  4.2.6,  we  allow  the  processors  to  forget  the  messages  they 
received  in  the  past,  but  they  are  restricted  to  remember  all  their  past  observations 
In  this  case  the  total  information  available  to  all  processors  is  preserved. 

Assumption  4.2.7;  (Perfect  Memory)  We  let  Assumptions  4.2.4  and  4.2.6  hold  and 

assume  that  F1  C  F1  ,  ,  Vi,n. 

n  n+1 

Whenever  Assumption  4.2.7  holds,  we  will  denote  by  the  smallest  a- field 
containing  F\  for  all  n. 

We  now  define  a  few  special  cases  of  particular  interest: 

(i)  Estimation  Problem;  We  are  given  a  Revalued  random  vector  x  on  (ft,F,P)  .  The 

objective  is  to  minimize  the  mean  square  error.  Hence,  the  cost  function  is 
T 

c  (v)  =  (x-v)  (x-v) ,  where  T  denotes  transpose.  It  is  easy  to  see  that  this  is  a 
particular  case  of  a  strictly  convex  function  covered  by  Assumption  4.2.2,  with  A(oa) 
being  a  constant. 

(ii)  Static  Linear  Quadratic  Gaussian  Decision  Problem  (LQG) :  Let  x  be  an  unknown 
random  vector.  Let  the  sequence  of  transmission  and  reception  times  be  deterministic 
We  assume  that  the  random  variables  observed  by  the  processors  are  zero  mean  and, 
together  with  x,  jointly  normally  distributed.  We  allow  the  total  number  of  observa¬ 
tions  to  be  infinite.  Let  U=Rn.  The  objective  is  to  fix  v  so  as  to  minimize  the 

T  T 

expectation  of  the  quadratic  cost  function  c(v)  =  v  Rv+x  Qv,  with  R>0.  If  follows 
that  the  optimal  tentative  decision  of  processor  i  at  time  n  is 

u1  =  GELxlF1]  =  E (Gxl F1] ,  where  G  is  a  precomputable  matrix.  If  we  redefine  the 
n  1  n  '  n 

unknown  vector  x  to  be  equal  to  Gx  instead  of  x,  we  conclude  that  we  may  restrict 
to  estimation  problems,  without  loss  of  generality. 
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(iii)  Finite  Probability  Spaces:  Here  we  let  ft  be  a  finite  set.  Then,  there 

exist  finitely  many  a-fields  of  subsets  of  ft.  Strict  convexity  implies  that  for 

each  a-field  F  C  F  and  any  co€ft  of  positive  probability  there  exists  a  unique  optimal 
o 

tentative  decision.  This  implies  in  turn  that  tentative  decisions  take  values  in 
some  finite  subset  of  U,  with  probability  1.  We  will  therefore  assume,  without  loss 
of  generality,  that  U  is  a  finite  set. 

We  conclude  this  section  by  presenting  a  simple  example  that  illustrates  our 

scheme,  under  imperfect  memory  assumptions ,  where  each  processor  forgets  everything, 

except  for  its  last  decision  and  the  most  recent  message  it  has  received. 

Suppose  that  we  only  have  two  processors  who  communicate  to  each  other  their 

1  2 

tentative  decisions  at  each  instant  of  time.  Let  x,  z  ,z  ,  n=l,2,...  be  independent. 

n  n 

random  variables,  with  known  probability  distributions.  Let  y1  =  Mx.z1)  be  the 

n  n 

observation  of  processor  i  at  time  n.  Let  c(x,v)  be  a  cost  function,  satisfying 
Assumption  4.2.2. 

The  tentative  decisions  are  defined  inductively  as  follows:  Suppose  that 

y1,  u1  , ,  u2  are  known  by  processor  i  at  time  n.  It  then  computes,  for  each  v, 
n  n-1  n-1 

the  conditional  cost  E  [c  (x,v)  I  y* ,  u*"  ,  u2  ,  ],  which  is  equal  to  g1(yi,u1  ,  ,u2  ,v)  , 

'  n  n-1  n-1  n  n  n-1  n-1 

for  some  Borel  function  g1.  Finally,  it  chooses  a  minimizing  v,  which  is  a  function 

n 

of  y1,  u^  ,  ,  u2  ,  and  this  is  its  tentative  decision  at  time  n.  Hence,  for 
n  n-1  n-1 

1  2 

appropriate  functions  f  ,  f  ,  we  have 

n  n 


i  _i.il  2  .  ,  _ 

u  =  f  (y  ,u  ,u  ) ,  1=1,2 

n  n  n  n~l  n-1 

If  we  now  define  (ft,F,P)  to  be  the  product  of  tne  probability  spaces  on  which  x,  z1 

n 

i  i  i  2 

are  defined,  and  let  F  be  the  rr- field  generated  by  y  ,  u  .  ,u  .  .  Assumptions  4.2. J 

n  n  n-i  n— i 

and  4.2.4  are  satisfied  and  the  asymptotic  properties  of  the  above  recursions  can  be 
analyzed  within  our  general  framework. 


4.3  CONVERGENCE  AND  AGREEMENT  RESULTS 


‘ 


b 


♦ 


In  this  Section  we  state  and  discuss  our  main  results.  Assumptions  4.2.2 
and  4.2.3  will  be  assumed  throughout  the  rest  of  this  chapter  and  will  not  be 
explicitly  mentioned  in  the  statement  of  each  theorem.  Before  proceeding  to  our 
results,  we  prove  a  Lemma  to  be  used  later. 

Lemma  4.3.1:  Let  {u  } ,  {w  }  be  two  sequences  of  U-valued  random  variables  such  that 
-  n  n 


lim  E 
n-*» 


(u  +w  \ 

—)y 


-  lim  E[c(u  )]  =  lim  E[c(w  )]  . 

n  n 

n-*»  n-*» 


(4.3.1) 


If  Assumptions  4. 2. 1.2  and  4.2.2  hold,  then  lim  (u  -w  )=0  in  L„(U,F,y)  and  in 

n-*»  n  n  2 


probability . 


Proof:  By  Assumptions  4. 2. 1.2,  4.2.2  and  equation  (4.3.1) 

.  ,2 


lim  E (A (w) 
n-H*> 


u  -w 
n  n1 


]<  lim  E 

n-xso 


c(u  )+c(w  )  /  u  +w  \"J 

2  _  E  c  (  -A— y  J  =c 


(4.3.2) 


whuh  shows  that  (u  -w  )  converges  to  zero  in  L_(R,F,y).  Therefore,  it  also  con- 

n  n  2 

verges  lii  measure  with  respect  to  y. 

Recall  that  A(u)>  0  and  y(B)  =  ^A(u)  dP(u>)  ,  \/B€F.  Therefore,  y(B)=0  implies 

P\h)=-j  and  P  is  cibsolutely  continuous  with  respect  to  y.  Let  B^  =  { |  un_wnl  • 

e 

since  (u  -%»  )  converges  to  zero  in  measure  y,  for  any  E>0,  we  have  lim  y(B  )=0 
n  n  nr*30  n 

e 

and,  by  absolute  continuity,  lim  P(B^)=0,  which  shows  that  we  have  convergence  in 
probability . q 

Our  first  theorem  holds  under  least  restrictive  assumptions  on  memory: 

•That  is,  in  the  mean  square  with  respect  to  the  measure  y. 


Theorem  4.3.1:  We  assume  that  transmissions  and  receptions  are  deterministic, 


that  communication  delays  are  bounded  and  that  the  time  between  two  consecutive 
transmissions  from  processor  j  to  processor  i  (with  j€A(i))  is  bounded.  Then, 
under  Assumptions  4. 2. 1.2  (convex  costs)  and  either  Assumption.  4.2.4  or  4.2.5 
(imperfect  memory) : 

a)  lim  (u1  -u1)^,  in  probability  and  in  L  (ft,F,n). 

n-^00  n+1  n  2 

b)  lim  (u1-u'3)=0,  Vi,j,  in  probability  and  in  L„(£2,F,li). 

n-*50  n  n  2 


Proof;  We  start  with  the  proof  under  Assumption  4.2.4.  Since  u^  is  F^+^-measurable , 

we  have  (by  the  minimizing  property  of  u1  ,  jEtcCu1  , ) ] <E [c (u1) ] .  Since  c  is  non- 

n+1  n+1  —  n 

negative,  Etctu1)]  converges  to  some  constant  g1.  We  also  note  that  (u1  , +u'L)/2  is 
n  n+1  n 

F1  -measurable  and  by  taking  the  limit  in  the  relation 
n+1 


~  E  [c  (u1  )  +c  ( 

2  n+1 


i,„  J  ( “n+l+Un  ^ 

v  eL°  \— — /J 


>  E[c(u  )  ] 
—  n+1 


we  obtain  lim  E[c((u1  +U1) /2) ]=g1 .  Lemma  4.3.1  then  yields  the  first  part  of 

n-*®  n+l  n 

the  theorem. 

Let  jeA(i)  .  Then  there  exist  sequences  {m^}  and  {n^}  of  positive  integers 

such  that  lim^^m^  =  lim^  )con^=°°  and  m^n^  are  the  times  of  transmission  and  recep 

tion,  respectively,  of  the  k-th  message  from  processor  j  to  processor  i.  Therefore, 

u3  is  F1  -measurable,  for  all  k,  and  E[c(u1  )]<E[c(u‘1  )]  which  shows  that  g1<g‘) . 

\  \  \  \ 

Using  Assumption  4.2.3,  we  conclude  that  g1=g1,  Vi,j. 
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Taking  the  limit  in  (4.3.3),  using  Lemma  4.3.1  and  the  boundedness  assumptions  on 

the  communications,  we  obtain  the  second  half  of  the  theorem. 

We  now  assume  Assumption  4.2.5.  Let  i (n)  be  a  sequence  of  processors  such 

that  i(n)ei(n),  Vn.  Then,  u^"  ^  is  F^  -measurable  and  E  [c  (u^  )  ]  is  a 

n  n+1  n 

decreasing  sequence.  Similarly  with  the  first  part  of  the  proof,  we  conclude  that 

u"*"  (n+l)_ui  (n)  converges  to  2ero  j_n  l  (Q,F,li)  and  in  probability.  It  follows  that 
n+1  n  2 

u1  -u1  and  u1-^  converge  to  zero,  for  j€A(i)  .  Using  Assumption  4.2.3,  u1-u'1 

n+1  nnn  ’  J  nn 

converges  to  zero,  for  all  i,j.— 

Consider  the  following  situation:  At  time  zero,  before  any  observations  are 
obtained,  the  sequence  of  transmissions  and  receptions  is  selected  in  random,  ac¬ 
cording  to  a  statistical  law  which  is  independent  from  all  observations  to  be  obtained 
in  the  future  and  from  c(v),  for  any  veu.  In  other  words  communications  do  not  carry 
any  information  relevant  to  the  decision  problem,  other  than  the  content  of  the 
message  being  communicated.  Suppose  that  the  sequence  of  communications  that  has 
been  selected  becomes  known  to  all  processors.  From  that  point  on,  the  situation  is 
identical  with  that  of  deterministic  communications.  In  fact,  a  moment's  thought 
will  show  that  it  is  sufficient  for  the  history  of  communications  to  become  commonly 
known  as  it  occurs:  processor  i  only  needs  to  know,  at  time  n,  what  communications 
have  occured  up  to  that  time,  so  that  it  can  interpret  correctly  the  meaning  of  the 


messages  it  is  receiving. 
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We  can  formalize  these  ideas  as  follows:  We  are  given  a  product  probability 
space  (ft  x  ft*,  F  x  F*,  P  x  P*)  where  (ft,F,P)  describes  the  decision  problem  and 
where  .  (ft*, F*,P*)  describes  the  communications  process.  We  assume  that  for  each 
u>*eft*,  the  resulting  process  of  communications  satisfies  the  assumptions  of  Theorem 
4.3.1.  Then,  note  that  for  each  U)*€ft*  we  obtain  a  distributed  decision  problem  on 
(ft,F ,P)  with  deterministic  communications.  In  that  case: 

Theorem  4.3.2:  Under  Assumptions  4. 2. 1.2,  either  4.2.4  or  4.2.5,  and  independent, 

commonly  known  communications  (as  described  above),  lim  _ (u1  ,-u1)  = 

n*’0  n+1  n 

=  lim  (u1-u‘))=0,  in  probability  with  respect  to  P  x  P*. 
n-*»  n  n 


Proof:  Theorem  4.3.1  and  the  discussion  preceding  the  statement  of  Theorem  4.3.2 


show  that  lim  _ _ (u1  -uX)=0,  in  probability  with  respect  to  P,  for  all  ui*eft*. 

ir*00  n+1  n 


Let  (u ),(*)*)  be  the  characteristic  function  of  the  set  {  (u>,a)*)  :  |  |  uX+1~uX|  |  <e} 


Then 


lim yXnd(PxP*)  =  lim  /Yx  dPdP*  =  f lim  f X^dPdP*=l 
n+oo  n  n-voo  JJ  n  J  n+co  '  n 


n+°° 


nr*m 


(The  first  equality  holds  by  the  Fubini  theorem;  the  second  by  the  dominated  con¬ 


vergence  theorem;  the  third  by  convergence  of  (u^+1~u^)  to  zero,  with  respect  to  the 


probability  measure  P.)  This  shows  that  u1  -uX  convergence  to  zero  in  probability 

n+1  n 


with  respect  to  PxP*.  Similar  steps  show  that  uX-u^  also  converges  to  zero,  in 

n  n 


probability  .| 


Strictly  speaking.  Theorems  4.3.1  and  4.3.2  do  not  guarantee  convergence  of  the 


decisions  of  each  agent.'  Suppose,  however,  that  the  agents  operate  under  the 


following  rule:  Fix  some  small  Y>0.  Let  the  sequence  of  communications  and  updates 

of  tentative  decisions  take  place  until  |u1-u^|<  Y,  Vi,j  (small  disagreement)  and 

n  n 

lu1  , -u1 I <Y/  (small  foreseeable  changes  in  tentative  decisions).  Then,  we  obtain: 
n+l  n 

Corollary  4.3.1:  With  the  above  rule  and  the  assumptions  of  Theorems  4.3.1  or 
4.3.2,  the  process  terminates  in  finite  time,  with  probability  1,  for  any  y>0* 

Proof:  By  Theorem  4.3.1,  the  UxU-valued  sequence  of  random  variables  (u1-u‘) ,  u1  -u1) 
-  n  n  n+l  n 

converges  to  (0,0)  in  probability.  It  therefore  contains  a  subsequence  converging 
to  (0,0),  almost  surely.  Therefore,  VY>0  and  for  almost  all  ooeft,  3n^  such  that 

lu1  -u3  |<Y,  I u1  ,  -u1  |<y  and  the  termination  condition  is  eventually  satisfied 
1  n  n  1  1  n  +1  n  ' 

o  o  o  o 

with  probability  one.fi 

When  ft  and  U  are  finite,  convergence  and  agreement  are  obtained  after  finitely 
many  stages: 

Theorem  4.3.3:  If  ft  and  U  are  finite  sets,  if  each  processor  communicates  all  the 

values  of  v  that  minimize  ElctvjlF1]  and  if  Assumption  4.2.4  holds,  then  there  exists 

n 

some  positive  integer  M  such  that 

u^  =  u>,  Vi, j  and  u^+n  =  u*,  Vi,  Vn,  Vweft. 

Proof:  Because  of  the  finiteness  of  ft,  there  exists  a  finite  (non- random)  time  after 
which  communications  (conditioned  on  past  events)  are  deterministic.  We  may  take  that 
time  as  the  initial  time  and  assume,  without  loss  of  generality,  that  all  communications 
are  deterministic. 


-90- 


Let  u1  be  the  set  of  elements  of  U  which  are  optimal,  given  p1 •  Let  w1 
n  n  n 

denote  a  F^-measurable  ramdom  variable  such  that  w1  (uj)eu1  (w)  ,  Vujeft.  (Note  that 
n  n  n 

Elcfo1)]  is  independent  of  how  w1  has  been  selected).  By  finiteness  of  ft  and  U, 
n  n 

there  exist  finitely  many  U-valued  ramdom  variables  amd,  since  E[c(w^+1>  J^Etcp**)  ] , 

we  conclude  that  there  exists  some  positive  integer  T  amd  some  g1  such  that 

EtcCw1)]  =  g1,  Vn  >  T.  For  amy  n  >  T,  Etctw1)]  =  ELcCw1  )]  and  since  w1  is  F\  - 
n  n  n+1  n  n+i 

measurable,  w1  minimizes  E[c(w)]  over  all  F1  -measurable  random  variables.  Hence, 
n  n+1 

w1(u))e  u1  ,  (ui),  Vuieft  which  shows  that  u1(a))cu1  ,  (a )) ,  Vaieft.  Again,  by  finiteness  of 
n  n+1  n  n+1 

U  amd  ft ,  there  exists  some  positive  integer  M  such  that  u^+^  ((*))  =  u* (w) ,  Vn  >  M , 

Vcoeft,  Vi. 

If  jeA(i) ,  there  exist  m,n>M  such  that  w3  is  F^-measurable  amd  this  shows  that 

m  n 


g^g^.  By  Assumption  4.2.3,  we  obtain  g1  *  g3,  Vi,j.  Therefore,  minimizes 

E[c(w)]  over  all  F1-meas\irable  ramdom  variables  amd,  therefore,  w-3  (oj)eu1  (w)  ,  or, 
n  m  n 

u3  (oj)  C  u1  (o>) ,  Vojeft.  Recalling  Assumption  4.2.3,  we  obtaiin  u3  (to)  =  u  (<o) ,  Vi,j, 
m  n  m  n 

Vm,n  >  M,  Vto.  ■ 

Strictly  speaking,  tentative  decisions  in  the  above  theorem  are  not  elements  of 
U  but  subsets  of  U.  This  is  to  compensate  for  the  possibility  of  non-uniqueness  of 
optimal  tentative  decisions.  The  equalities  appealing  in  Theorem  4.3.3  have  to  be 
interpreted,  therefore,  as  equalities  of  sets. 

We  now  assume  that  the  processors  have  perfect  memory.  We  obtain  results  similar, 
to  Theorems  4.3.1  and  4.3.2  under  much  more  relaxed  assumptions  on  the  communications 
process.  Namely,  we  only  need  to  assume  the  following: 


Assumption  4.3.1:  Let  M^J  be  the  k-th  message  sent  by  processor  j  to  processor  i. 

We  assume  that  when  processor  i  receives  ,  it  knows  that  this  is  indeed  the  k-th 
message  sent  to  it  by  processor  j . 

Remark:  This  assumption  is  trivially  satisfied  if  messages  arrive  at  exactly  the  same 
order  as  they  are  sent,  with  probability  1. 


Lemma  4.3,2:  Let  T  be  a  finite  stopping  time  of  an  increasing  family  {Fn}  of  a-fields. 

Let  u  ,  n-1,2,...  be  random  variables  that  minimize  E[c(w)|F  ],  almost  surely,  over 
n  n 

all  F^-measurable  random  variables  w.  Then,  uT  minimizes  E[c(w)]  over  all  FT~measurable 

random  variables  w,  where  u  is  defined  to  be  equal  to  u  if  T=«n. 

T  n 

Proof:  Let  X  be  the  indicator  function  of  the  set  {a):T(a»)=n} .  Since  T  is  a  stopping 
-  n 

time,  X  is  F  -measurable.  Note  that  X  c(u  )  =  X  c(u_) .  Let  w  be  a  F  -measurable 
n  n  n  n  n  T  T 

random  variable  and  note  that  X  c(w)  =  X  c (X  w)  and  X  w  is  F  -measurable.  Therefore, 

n  n  n  n  n 


E[X  c (w)  F  ]  =  X  E [c (X  w) I F  ]>X  E[c(u  ) | F  ]  -  E [X  c (u  ) | F  ]  =  E [X  c (u  ) | F  ] . 

n  "  n  n  1  n  —  n  n  n  nn'n  nTn 


n  n  1  n 


n  T  1  n 


Taking  expectations,  we  obtain 


E [X  c (w)  ] >E [X  c(uj] 
n  —  n  i 

and  summing  over  all  n’s  (and  using  the  monotone  convergence  theorem  to  interchange 
summation  and  expectation)  we  obtain  E[c(w) ]>E[c(u  ) ] .g 


Theorem  4.3.3:  Under  Assumptions  4. 2. 1.2  (convex  costs),  4.2.7  (perfect  memory) 

and  4.3.1,  there  exists  a  U-valued  random  variable  u*  such  that  lim  u^=u*,  )fi.  in 

rr*»  n 

probability  and  in  L. (ft,F,y) . 


Proof: 


Since  u  is  F  -measurable,  we  have  E{c(u  ,)]<  E[c(u  )].  Since  c  is  non- 
n  n+1  n+1  —  n 

negative,  Elctu1)]  converges  to  some  constant  g1.  We  also  note  that  (u^u1  )/2  is 

n  n  n+m 

F1  -measurable .  Therefore,  EtcUu1  +u1)/2)]>  Elctu1,  )]>g\  Fix  some  e>0,  and  .let  r 
n+m  n+m  n  —  n+m  — 

be  large  enough  so  that  ElcCu1) ]<g1+e.  Then,  using  Assumption  4.2.2,  we  obtain 

n  — 

E [A (to)  I  I u1  -u1!  |2]<e,  vm>0.  Therefore,  {u^}  is  a  Cauchy  sequence  in  L_(S2,F,y). 

n+m  n '  '  —  —  n  2 

By  the  completeness  of  L 2  spaces,  there  exists  a  U- valued  random  variable  u1  such  that 

lim  u^u1,  in  L.,(ft,F,y)  and,  therefore,  in  probability,  with  respect  to  P.  (The 

proof  of  the  last  implication  is  contained  in  the  proof  of  Lemma  4.3.1) .  Since 


E  [E  [c  (u*+1>  I F^)  | F*)<  E  (E  [c  (uS  I F^,  1 1  Fi  ]  -EtcluillF*]  , 


{e [c (u1)  F1] ,  n=l,2,...}  is  a  supermartingale ,  with  respect  to  {F1}.  Moreover,  since 
n  n  n 


for  any  fixed  veu,  E  [c  (u1)  I  F1]<  E[c(v)|FX],  it  is  a  uniformly  integrable  super- 

n  n  —  n 


martingale  [Meyer,  1966,  Theorem  T19,  p.90]. 

Let  jeA(i).  Let  m^,  n^  be  the  times  of  transmission  and  reception,  respectively, 

of  the  k-th  message  from  j  to  i.  Because  of  Assumption  4.3.1,  m^  and  are  stopping 

times  of  {F3},  {F1},  respectively,  and  since  j€A(i) ,  they  are  almost  surely  finite 
n  n 

stopping  times,  for  all  k.  Moreover,  k<m^<n^  and  by  the  optional  sampling  theorem 
[Meyer,  1966,  Theorem  T2 8,  p.90]. 


E[c(u^)]>  E[c(u1  )]>  g1 
“k  -  ^  - 

which  shows  that  lim^^Etctu^  )]=g*.  Similarly,  lim^ (0pE [c (u^  )]®g^. 


Note  that  u  is  f  -measurable  and,  by  Lemma  4.3.2,  E[c(u  )]>  E[c(u  )]. 


Taking  the  limit,  we  obtain  g^g1,  and  by  Assumption  4.2.3,  gx=gJ ,  Vi,j. 
We  now  take  the  limit  of  the  inequalities 


.1  3 


tElclu1  )  +  c(u^  )  ] >  E 
2  \  \  ~ 


r  /  u1  +u]  \] 


C (U1  )  ] 


1  ^..3 


to  obtain  lim.  E[c((u~  +uJ  )/2)]=g  and,  by  Lemma  4.3.1,  lim  (u1  -u3  )=0,  in 
L2(fl,F,p)  and  in  probability. 


We  also  take  the  limit  of  the  inequalities 

-  i  .  i  _ 

(u  +\  \ 

— 2 - ) 


>  E[c(u  )  ] 
\ 


to  obtain  lim^>mE[c(  (uj^+u1  ) /2)  Y^gX  and,  by  Lemma  4.3.1,  lim^|ro(u:L  -ru£)=0. 

"k  \ 

Similarly,  we  obtain  lim^  |m  (u^-u^)  =0,  which  shows  that  u^u3  ,  almost  surely.  a 

For  estimation  problems  (u^=E [x| F^] ) ,  Theorem  4.3.4  can  be  slightly  strengthened: 
[Borkar  and  Varaiya,  1982,  Theorem  2], 


Theorem  4.3.5:  For  estimation  problems,  under  the  assumption  of  Theorem  4.3.4, 
convergence  to  u*  takes  place  with  probability  1. 

We  now  consider  the  case  where  U  is  finite  but  (unlike  Theorem  4.3.3)fl  is  allowed 
to  be  infinite.  Several  complications  may  arise,  all  of  them  due  to  the  fact  that 
optimal  decisions,  given  some  information,  are  not  guaranteed  to  be  unique.  We  discuss 
these  issues  briefly,  in  order  to  motivate  the  next  theorem. 

Suppose  that  U*{v  ,v  }.  It  is  conceivable  that  E[c(v, ) I F1]“E(c(v-) |F^]  is  never 
zero  and  changes  sign  an  infinite  number  of  times,  on  a  set  of  positive  probability. 


In  that  case,  the  decisions  of  processor  i  do  not  convege.  Even  worse,  it  is  conceivab 
that  E  [c  (v  )  I  F1]>  E  [c  (v  )  |  F1]  and  E[c  (v  )|  F*  ]<  E[c(v_)|F^],  for  all  n  and  for  all  w 
in  a  set  of  positive  probability,  in  which  case  processors  i  and  j  disagree  forever. 

It  is  not  hard  to  show  that  in  both  of  the  above  cases  E[c(v  jjF1]  =  E[c(v~) ( F1 ] ,  on 
a  set  of  positive  probability  and  this  non- uniqueness  is  the  source  of  the  pathology. 
The  following  theorem  states  that  convergence  and  agreement  are  still  obtained,  provide 
that  we  explicitly  exclude  the  possibility  of  non-uniqueness. 

Theorem  4.3.6;  Under  Assumption  4. 2. 1.1  (finite  U)  and  4.2.7  (perfect  memory)  and  if 
the  random  variable  u1  that  minimizes  E[c(w)]  over  all  F1-measurable  random  variables 

OO 

is  unique  up  to  a  set  of  measure  zero,  for  all  i,  then  lim  ^u^u1,  almost  surely, 

n"*°°  n 

and  u1=u:) ,  Vi ,  j  . 

Proof:  Fix  some  veu  and  let  B  *  {u>s  uNuO^v}.  Then,  E[c(v)|FX]<  E[c(v*)  |  F1] ,  Vv*j<v 
“  '00  '00 

for  almost  all  (jj€B.  By  the  martingale  convergence  theorem  [Meyer,  1966,  Theorem  T17, 
p. 84]  ,  we  conclude  that  for  almost  all  u>eB,  there  exists  some  N(io)  such  that 

E[c(v)  |Fi]<  E[c(v*)  IF1]  Vn>N(w) 
n  n  — 

Therefore,  lim  u1  (oj)  =v,  for  almost  all  u€B  and,  by  considering  the  other  elements 

nr*00  n 

of  U  as  well,  lim  ux=ux,  almost  surely. 
n-*30  n 

If  j€A(i) ,  u3  is  FX-measurable  and  E[c(u^)]>  E[c(uX)].  By  Assumption  4.2.3, 
E[c(uX)]  =  Etctu3)],  Vi,j.  Therefore,  for  jeA(i),  u^  minimizes  E[c(wj]  over  all 
F^-measurable  random  variables  and  by  the  assumptions  of  the  theorem,  u1=u'J ,  almost 
surely.  Using  Assumption  4.2.3  once  more,  we  obtain  uX=u3 ,  Vi,j.a 


Although  the  preceding  theorems  guarantee  that  (under  certain  conditions)  all 


processors  will  agree,  nothing  has  been  said  concerning  the  particular  decision  to 
which  all  agents'  decisions  converge.  In  particular,  some  single  examples  show  that 
the  limit  decision  can  be  different  from  the  optimal  centralized  solution  (that  is, 
the  solution  to  be  obtained  if  all  agents  were  to  communicate  all  their  information) . 

On  the  other  hand,  the  centralized  solution  is  reached  for  LQG  problems,  under  the 
perfect  memory  assumption  [Borkar  and  Varaiya,  1982]  and  is  also  reached  generically 
for  am  estimation  problem  on  a  finite  probability  space  [Geanakopoulos  and  Polemarchakis , 
1978] .  This  issue  will  be  touched  again  in  the  next  section. 

Robustness  with  respect  to  Communication  Noise 

Schemes  that  centralize  information  by  coding  (e.g.  by  using  the  least  significant 
bits  of  the  allowed  messages  [Aoki,  1973;  Sandell  and  Athans,  1974]  tend  to  require 
high  bandwidth  and  axe  sensitive  to  noise  in  the  communication  channel.  In  our  scheme, 
although  real  numbers  aire  being  transmitted  (infinite  information  content) ,  the  least 
significant  bits  are  not  as  essential.  As  a  result,  the  qualitative  convergence 
properties  of  our  scheme  are  retained  even  if  communications  of  the  tentative  decisions 
are  assumed  to  be  noisy.  We  provide  a  proof  of  this  fact  for  estimation  problems, 
under  the  perfect  memory  assumption . 

Suppose,  as  before,  that  at  random  times  processor  j  communicates  its  optimal 

tentative  decision  u^ .  However ,  the  message  received  by  the  other  processors  is 

n 

Aj  j  j  j  , 

u  =u  +  q_,  where  q  is  a  random  vector  representing  the  noise  in  the  channel.  For 
n  n  n  n 

simplicity,  we  assume  that  the  noise  vectors  are  independent,  identically  distributed. 


Theorem  4.3.7:  Assume  noisy  communications  (as  described  above) .  For  estimation 


problems,  under  Assumption  4.2.7  (perfect  memory) ,  there  exists  a  U-valued  random 

variable  u*  such  that  lim  uX=u*,  Vi,  with  probability  1. 

n-*»  n 

Proof:  Let  x  be  the  unknown  vector  to  be  estimated.  Then  u1=E[x|F1]  and  converges 
-  n  1  n 

almost  surely  to  u1  =  E[x|F*].  Moreover,  u1  minimizes  E[c(w)]  over  all 

F*-measurable  random  variables  w.  Let  jSA(i)  and  let  m^  be  the  time  of  transmission 

,  M 

1  r  i  * 

of  the  k-th  message  from  j  to  i.  Note  that  —  )  u  is  F  -measurable  (and  hence 

M  k=l  m^  nM 

F1-measurable)  and  converges  to  u^  almost  surely.  Therefore,  u"1  is  F^-measurable 

and  E[c (u1) ]£  E [c (u3 ) ]  and,  by  Assumption  4.3.3,  Elcfu1)]  =  Etcfu"1)],  Vi,j.  The 

minimizing  property  of  u1  and  the  strict  convexity  of  the  quadratic  cost  function 
i  j 

imply  that  u  =u  ,  almost  surely .q 

4.4  THE  LINEAR  QUADRATIC  GAUSSIAN  (LQG)  MODEL 

In  this  Section  we  specialize  and  strengthen  some  of  our  results  by  restricting 

to  the  Linear  Quadratic  Gaussian  model  described  in  Section  4.2.  (Recall  that  any 

such  problem  is  equivalent  to  an  estimation  problem;  therefore,  u1=xn  =  E [x| F1] , 

n  l  n 

for  some  random  vector  x) .  Theorems  4.3.1,  4.3.4  and  4.3.5  are  applicable .  Moreover 
the  results  of  [Borkar  and  Varaiya,  1982]  guarantee  that,  under  Assumption  4.2.7 
(perfect  memory) ,  u^  converges  to  the  optimal  centralized  estimate ,  given  the 
information  possessed  by  all  processors.  The  following  theorem  states  that  the  same 
is  true  under  the  weaker  Assumption  4.2.6. 


Theorem  4.4.1:  For  the  LQG  problem,  under  the  assumptions  of  Theorem  4.3.1  and 


A 

Assumption  4.2.6  (imperfect  memory;  own  data  remembered) ,  lim^ )o_x^=x,  in  the  mean 

square,  where  x=E[x|F  ]  and  F  is  the  smallest  a- field  containing  F1,  for  all  i,n. 

•  oo  oo  n 


Proof ;  As  is  usual  in  linear  least  squares  estimation,  we  use  the  setting  of  Hilbert 
spaces  of  square  integrable  random  variables.  Let  G  be  a  Hilbert  space  of  zero  mean, 
jointly  Gaussian  random  variables  on  (Q,F ,P)  such  that  each  component  of  the  unknown 
vector  x  and  the  observations  belong  to  G.  The  inner  product  in  G  is  defined  by 
<x,y>  =  E(xy] . 

For  each  processor  i,  let  H1  denote  the  smallest  closed  subspace  of  G  containing 

all  observations  obtained  by  it.  Let  H^  be  the  smallest  closed  subspace  of  G 

containing  all  observations  obtained  by  processor  i  up  to  time  n.  (Note  that  H^  does 

not  contain  all  random  variables  known  by  processor  i  at  time  n,  because  it  does  not 

need  to  contain  any  of  the  messages  received  by  processor  i) .  Note  also  that,  by 

i  i  i  N 

Assumption  4.2.6,  Hr  C  Hn+^C  h  3X151  that  £  H^  is  the  total  knowledge  available 


to  all  processors  at  time  n.  The  centralized  estimate  is  the  projection  of  x  on 
N 

£  H  We  assume,  without  loss  of  generality,  that  x  is  a  scalar  random  variable, 


k=l 

since  each  component  can  be  separately  estimated. 

Let  x1  =  EtxlF1]  and  e1  *  x-x1  and,  by  the  orthogonality  of  errors  and 
n  1  n  n  n 

observations,  we  have  E[xy]  =  E [ x^y ] ,  VyeH1.  As  in  the  proof  of  Theorem  4.3.1  we  have 

n  n 

2  2 

|  len+jJ  I  <|  |ej  |  t  Vn,i  which  implies  that 


(4.4.1) 


In  particular,  (4.4.1)  implies  that  {x  }  is  a  norm-bounded  sequence.  By  the  weak 

n 

local  sequential  compactness  of  Hilbert  spaces  [Yosida,  1980,  p.126],  {x^}  contains 

a  weakly  convergent  subsequence  {x1  }.  In  other  words,  there  exists  an  element 

n£ 

i  -  iNkNk 

x  eG  such  that<y,x  >  converges  to<y,x_>,  Vy€G.  Moreover,  x  6  T  H  C  J  H 
°o  n  n  “  n  “ 

l  k=l  k=l 

and  since  closed  subspaces  are  also  weakly  closed  [Yosida,  1980,  Theorem  11,  p.  125]  , 

N  k  L  /si 

x  e  y  H  .  Now  let  y€H  .  Then,<y,x  >  =<y,x>,  for  all  £  such  that  n0>n,  which 
°°  k=l  n  nl  ^ 

implies  that  <y,x  >  =  <y,x>-  Moreover,  the  sequence  of  subspaces  {h1}  generates 
o°  n 

i  /v  i 

H  which  implies  that  <y,xo>=  <y,x>,  Vy€H  . 

By  Theorem  4.3.1,  (x^-x^)  converges  in  the  mean  square  (and  therefore  weakly) 
to  zero,  which  implies  that  x  is  also  a  weak  limit  point  of  lx  }.  The  same  argument 

oo  n)l 

as  before  shows  that <y,x^>=<y,x>  ,  VyeH"5,  \/j.  Therefore,  <y,xo>=<V,x>  , 

N 

Vye£  H  .  But  this  is  exactly  the  condition  that  xm  is  the  centralized  estimate, 
k=l 

given  the  observations  of  all  processors.  So,  {x1}  has  a  unique  weak  limit  point, 

n 

which  is  the  same  for  all  i  and  coincides  with  the  centralized  estimate. 

A 

It  only  remains  to  show  that  converges  to  strongly  (in  the  mean  square)  . 

We  know  from  [Yosida,  1980,  p.120]  that  j  |xj  |<  lim  inf  |  |x*|  |  .  On  the  other  hand, 


l*ll2-IIJUI2=llx-x„ii25ii*-iiii2=ilx|l2-ll^ll2 


(4.4.2) 


which  shows  that  |  |xj  |>  lim  sup^l  |x*[  |  .  Therefore,  |  |xj  |=  lim^l  |x* 
Theorem  8,  p.124  of  [Yosida,  1980],  we  conclude  that  liinl  I  x^-x .11  2=0 .  a 


^i 

x  and  by 

n 


Note  that  Theorem  4.4.1  is  much  stronger  than  Theorem  4.3.1  which  was  proved 

for  the  general  case  of  imperfect  memory.  We  have  here  convergence  to  a  limit 

solution  which  is  also  guaranteed  to  be  the  optimal  centralized  solution. 

* 

Our  next  result  concerns  the  finite  dimensional  LQG  problem  in  which  the  total 

number  of  observations  is  finite.  Namely,  the  smallest  a- field  containing  F1  for 

n 

all  i,  n  is  generated  by  a  finite  number  of  (jointly  Gaussian)  random  variables.  In 
that  case,  the  centralized  solution  is  going  to  be  reached  by  all  processors  in  a 
finite  number  of  stages,  provided  that  all  processors  have  perfect  memory. 

Theorem  4.4.2:  For  the  LQG  problem  with  finitely  many  observations  and  under  Assumption 
4.2.7  (perfect  memory),  the  centralized  solution  is  reached  by  all  processors  in  a 
finite  number  of  stages. 

Proof:  We  use  again  the  Hilbert  space  formalism  of  the  previous  proof.  Let  G^  be 

the  subspace  of  G  describing  the  knowledge  of  processor  i  at  time  n  (both  its 

observations  and  the  messages  it  has  received).  By  Assumption  4.2.7,  we  have 

G1C  G1  ,  C  G.  Since  G  can  be  chosen  to  be  finite  dimensional,  there  exists  some  M 
n  n+1 

(depending  on  the  sequence  of  communi cations  but  deterministic)  such  that 
G*  =  G  ,  Vn>0,  Vi.  Equivalently,  x^“  =  x\  Vn>0,  Vi»  and  by  Theorem  4.4.1, 

*  j  ~  w  • 

*M  =  *M  =  x«' 

Theorem  4.4.1  and  4.4.2  imply  that  the  scheme  considered  in  this  Section  may  be 
viewed  as  am  algorithm  for  solving  static  linear  estimation  problems,  an  issue  that  we 
discuss  below. 

The  intuitive  argument  behind  Theorem  4.4.2  is  the  following:  once  a  processor  has 
received  enough  messages,  it  is  able  to  infer  exactly  the  values  of  the  observations 


of  the  other  processors  (or  of  some  appropriate  linear  combinations  of  these 
observations)  and  compute  the  centralized  solution  itself.  So,  communicating 
optimal  tentative  decisions  is  in  this  case  just  another  way  for  communicating  all 
information  to  all  other  processors.  This  scheme  does  not  seem  to  have  any  particular 
advantages  (in  terms  of  communication  and  computation  requirements)  over  the  scheme 
where  each  processor  communicates  all  its  data  directly. 

However,  the  scheme  of  Theorem  4.4.1  (imperfect  memory)  seems  to  have  some 
appealing  features,  which  we  will  now  discuss.  Suppose  that  a  central  processor 
obtains  a  NM-dimensional  vector  of  observations .  Instead  of  inverting  the  NMxNM 
covariance  matrix  (which  would  require  0(N3M3)  operations)  it  splits  its  observations 
into  N  M-dimensional  vectors.  Each  M-dimensional  vector  is  to  be  handled  by  a  dif¬ 
ferent  processor  and  suppose  that  the  scheme  of  Theorem  4.4.1  is  to  be  used.  If  the 
ring  protocol  is  to  be  used,  there  will  be  one  inversion  for  each  M-dimensional  vector 
of  observations  and  for  each  round.  This  leads  to  0(NM3)  operations  per  round.  If, 

for  example,  an  acceptable  estimate  is  obtained  after  0(N)  rounds,  the  final  objective 

3  3 

will  have  been  accomplished  with  a  total  of  0(N  M  )  operations,  which  is  one  order  of 
magnitude  less  than  the  standard  algorithm.  It  is  not  hard  to  show  that  if  the  noises 
in  observations  belonging  to  different  blocks  of  data  are  uncorrelated,  agreement  on 
the  optimal  is  obtained  after  two  rounds  only.  Accordingly,  if  the  noises  in  observa¬ 
tions  in  different  blocks  are  weakly  correlated,  we  expect  this  scheme  to  be  faster 
than  the  standard  algorithm.  This  suggests  that  our  scheme  corresponds  to  a  potentiall 
advantageous  decomposition  algorithm  for  static  linear  estimation  problems.  This 
algorithm  has  some  similarities  with  those  suggested  by  Laub  and  Bailey  [1978] . 


We  now  continue  with  an  analysis  of  the  convergence  rate  of  the  decomposition 


algorithm,  for  two  particular  communication  protocols.  Let  G  be  a  Hilbert  space  of 

1  N 

zero-mean,  Gaussian  random  variables.  Let  H  ,...,H  be  closed  subspaces  and  let  H 

1  N 

be  the  smallest  subspace  of  G  containing  H  ,...,H  .  Let  x€G  be  a  random  variable  to 
be  estimated. 


Algorithm  1  (Star  Protocol) : 


o  __ 

0 

1 

o 

,  I  1  N, 

n>l 

E[x  y  , . . . ,y  ] , 

n 

1  n  n 

i 

n+1 

=  E[x|H1,y°] , 

n>l,  ie{l 

(4.4.3) 

(4.4.4) 

...,N}  (4.4.5) 


Algorithm  2  (Ring  Protocol) : 
y\  =  EfxlH1], 


i+1 

=  E[x|Hi+1,yi] , 

n>l,  i€{l,. 

n 

1 

,  |  1  N, 

n>l  . 

n+1 

=  E [x  H  ,y  ] , 

1  n 

(4.4.6) 

(4.4.7) 

(4.4.8) 


By  Theorem  4.4.1,  y1  converges  to  E[x|h],  in  the  mean  square,  for  either  of 

n 

the  above  algorithms.  Theorem  4.4.3  below  states  that  the  mean  square  error  converges 
geometrically  to  the  optimal  mean  square  error,  which  is  a  stronger  result.  Similar 
convergence  rate  results  may  be  proved  for  a  variety  of  other  protocols  as  well,  with 
much  more  "chaotic"  sequences  of  communications.  However,  the  proof  for  these  two 


examples  are  sufficient  to  illustrate  the  main  idea  of  a  more  general  proof. 


Theorem  4.4.3:  Let  H  be  finite  dimensional.  For  any  z€H,  z/0,  let 


a1(z)  = 


lEtzlH1] | 


(4.4.9) 


and 


a  =  inf  max  Qt^tz)  .  (4.4.10) 

z€H  i 
zj^O 

Then,  0<a<l  and 


a)  For  the  star  protocol 


b)  For  the  ring  protocol 


lyn+l'X‘ 


where 

x  =  e[x|h] . 


(4.4.11) 


(4.4.12) 


Proof:  The  inequality  oi<l  is  trivial.  Suppose,  to  derive  a  contradiction,  that 
a=0.  It  is  easy  to  see  that  the  infimum,  in  the  definition  of  a,  is  attained  by 
some  zeH,  z?<0.  For  that  particular  z,  we  will  have  a1(z)=0,  Vi,  which  implies  that 
z  is  orthogonal  to  H1,  Vi.  But  this  contradicts  the  assumptions  z€H,  z/0 . 

a)  Using  (4.4.9)  and  simple  orthogonality  relations: 

| 1 x-E[x| | | 2=  [|S-y°-E[S-y°|Hi.y°]||2  = 

-  Il*-y°l|2-  l|Elx-y°l»1.y°)||2  i 

<  I |S-y°| I2-  ||Eti-y°|Ki)||2  - 

•  ll*-y!ll  (lV(S-y°))  . 


(4.4.13) 
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Therefore , 


min|  Jx-ElxlH1,^]  |  J  <|  |x-y°|  j2  (1-max  a1  (x-y°)  )< 
.  *  n  1  n  n  — 

i  1 


<<l-a)||S-y°|| 


(4.4.14) 


Now  note  that 


E[x|Hi,y°]=E[E[x|H]  |Hi,y°]  =  EtxjH^y®]  =  y^+1 


(4.4.15) 


and  that 


llK+iH2 


Inequalities  (4.4.14)- (4.4.16)  yield  (4.4.11) 


.4.16) 


i-1  N 

b)  For  notational  convenience,  we  will  interpret  y  , ,  with  i=l,  as  y  . 

n+1  n 

Again,  starting  from  (4.4.9),  we  have: 


I  i*  N  |  |  ^  i  ^  N.  Nj|2 

a|  |*-y  1 1  £  a  (x-y  )  •  I  |x-yr 

i.  j_ 


'  n1 


=  max  |  |  E  [x-y^  |  H1  ]  |  |  < 

i 


<  max  I  I E [x-y^ | ,y^  1  1  2 

—  .  n  r 

x 


n+1 


i 

-  “+1  ! iyi+l”ynl I  ■ 


(4.4.17) 


A 


•/-i 


Let 


~i  i  i-1 

yn+l  =  yn+l  "  yn+l  ' 


(4.4.18) 


J 


^  -  -  -  - 


and  note  that 


i-l  „r  i-l.  _  r _  r  i„i  i-l.i  i-l,  „r  i  i-l, 
*n4.1  =  ECX^n  +  1]  =  E[EtXlH  'yn+15|vn+l]  =  Etyn+l|yn+1] 


n+1 1  n+1 


(4.4.19) 


This  implies  that  y^  orthogonal  to  y^+^. 

Therefore, 


i2-  Ihrtlill2*  11^.,  1 1*- 


(4.4.20) 


lly"  , l|2  *  l|y“l|2  +  l  II?! 


(4.4.21) 


Using  (4.4.20)  and  (4.4.21)  in  (4.4.17),  we  have 


«l|x“Ynl|2  <“*11  l  Y^+J  ^  1 

i  k=l 


<  nax  i  l  ||y*  ||  - 

i  k=l 


-»  i  ii?Ln2 


-  »(| l*-y"l |2-l l«-y“+1l I2) 


(4.4.22) 


(The  last  equality  is  obtained  from 


'2  1 1  >c-yN|  | 2  +  ||yNll2 


(4.4.23) 


which  is  a  consequence  of  the  orthogonality  conditions  for  linear  estimation) . 
Rearranging  (4.4.22),  we  obtain  (4.4.12).* 

The  main  conclusion  from  Theorem  4.4.3  is  that  the  estimation  error  decreases 

geometrically,  at  a  rate  which  is  independent  of  the  vector  being  estimated,  but 

1  N 

which  depends  on  the  subspaces  H  ,...,H  .  In  particular,  the  constant  a  depends  on 

the  angles  between  these  subspaces.  Accordingly,  the  convergence  rate  of  the  algorithm 

may  vary  significantly  when  different  decomposition  of  the  space  H  are  tried. 

1  N 

Apparently,  a  is  larger  when  the  subspaces  H  , . . . ,H  are  nearly  orthogonal,  in  which 
case  a  behaves  like  1/N,  as  N  changes.  This  suggests  that  more  rounds  of  computations 
are  needed,  in  general,  if  a  finer  decomposition  is  used.  On  the  other  hand,  finer 
decompositions  require  fewer  computations  per  round,  because  smaller  covariance 
matrices  have  to  be  inverted.  Theorem  4.4.3  is  not  adequate  to  resolve  this  tradeoff, 
primarily  because  inequalities  (4.4.11),  (4.4.12)  are  not  particularly  tight. 

Theorem  4.4.3  suggests  that  the  ring  protocol  can  lead  to  a  much  slower  algorithm 
than  the  star  protocol.  Numerical  experimentation,  however,  did  not  reveal  any  such 
effect.  This  may  be  explained  by  observing  that  the  proof  of  part  (b)  of  the  Theorem 
utilizes  bounds  which  are  likely  to  be  much  less  tight  than  those  in  part  (a) ,  most 

i  k  2  i  k  2 

notably,  the  inequality  ||  £  y  ,|  |  <i  |  ||y_+1||  • 

k=l  k=l 

We  continue  this  Section  with  a  discussion  of  the  implementation  of  the  decomposi¬ 
tion  algorithm.  First,  it  may  be  used  by  a  single  processor  (centralized  computation) , 
as  an  alternative  to  a  standard  algorithm  for  inverting  the  covariance  matrix.  Or,  it 


may  be  implemented  in  a  decentralized  manner,  whereby  each  block  of  data  corresponds 

to  a  physically  distinct  processor.  Note  that,  for  any  i,  n,  we  have  x1=a1y, 

n  n 

where  is  the  estimate  of  the  i-th  processor  at  time  n,  y  is  the  vector  of  all 

available  observations  and  a1  is  a  row  vector.  When  processor  j  receives  x1,  it 

n  n 

i 

must  also  know  a  ,  in  order  to  be  able  to  extract  information  from  x  .  There  are 
n  n 

two  choices: 

a)  Processor  j  computes  a\  possibly  off-line. 

b)  Processor  i  transmits  a^  to  processor  j . 

Which  of  the  two  should  be  done  depends  on  whether  communications  or  computations 
are  more  costly.  Whether  any  one  of  the  above  two  variations  cam  be  useful  depends 
on  the  particularities  of  the  actual  situation  amd  its  inherent  communication  amd 
computation  limitations.  More  numerical  experience  is  needed  before  a  definite  answer 
cam  be  given. 

Numerical  Results 

Let  x  be  am  unknown  scalair,  zero  mean,  random  variable  to  be  estimated 

2 

(E[x  ]=5.)  Let  y^x+w^,  (i=l,...,18)  be  the  observations.  The  noises  are  assumed 
to  be  independent  of  x.  We  split  the  18-dimensional 

observation  vector  into  blocks  of  data  (corresponding  to  distinct  processors)  amd  used 
the  decomposition  algorithm  of  Theorem  4.4.1.  We  employed  the  ring  protocol  amd 
assumed  that  at  each  stage  a  processor  only  knows  its  own  observations  and  the  most 
recent  message  it  received  (Assumptions  4.2.5,  4.2.6). 


Let  be  the  member  of  observations  assigned  to  processor  i.  We  considered 

two  alternative  decompositions:  (i)  N=2,  M  =10,  M  =8;  (ii)  N=6 ,  M  = . . . =M,  =3 .  We 

12  16 

first  executed  the  algorithm  using  the  covariance  E  and,  then,  once  more  using  the 
covariance  Zw+I . 

The  results  are  presented  in  Figures  4.4.1,  4.4.2.  The  horizontal  axis  denotes 
stages  (each  stage  corresponds  to  an  update  by  some  processor)  and  the  vertical  axis 
indicates  the  associated  mean  square  error.  The  dotted  horizontal  line  indicates  the 
centralized  mean  square  error.  The  curves  D1  and  D2  corresponds  to  the  first  and 
second  decomposition,  respectively.  As  expected,  convergence  was  much  faster  when  the 
identity  was  added  to  the  initial  covariance;  moreover  the  first  decomposition 
converged  much  faster  than  the  second. 

To  illustrate  the  merits  of  the  decomposition  algorithm  we  performed  a  rough 
count  of  operations.  We  only  took  matrix  inversions  into  account,  assuming  that  the 
inversion  of  a  MxM  matrix  requires  M3  operations,  which  is  accurate  enough  for  our 
purposes.  With  this  counting  scheme,  the  centralized  algorithm  required  5832  operations 
The  points  A,B  in  the  graphs  were  reached  after  4100,  1152  operations,  respectively. 

This  leads  to  the  following  conclusion:  While  the  first  decomposition  needs  very  few 
stages  to  converge,  it  does  not  have  any  particular  computational  advantages.  The 
second  decomposition,  however,  leads  to  an  estimate  close  to  the  optimal  with  much 
fewer  operations  than  the  centralized  algorithm. 

It  is  fair  to  say,  however,  that  the  decomposition  algorithm  we  have  studied 
above  should  be  compared  to  other  decomposition  algorithms  for  solving  linear  equations, 
rather  than  algorithms  requiring  0(N3)  operations. 


ro  c\J 
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4.5  A  MODEL  INVOLVING  A  COORDINATOR 

In  the  previous  sections  we  had  assumed  that  for  any  pair  of  processors  i,j,  proc¬ 
essor  i  is  allowed  to  communicate  to  j .  In  this  section  we  assume  that  a  particular 
processor  (denoted  by  the  superscript  o)  has  special  status  and  acts  like  a  coordinator' 
The  scheme  we  envisage  is  the  following:  At  each  instance  of  time  n,  processor  i 


evaluates  u  which  it  communicates  to  the  coordinator.  The  coordinator  then  combines 
n 

u1  to  uN  to  produce  a  tentative  decision  u°.  We  assume  that  the  coordinator  has  no  data 
n  n  n 

o 

of  its  own) .  It  then  transmits  un  to  all  other  processors  which  accordingly  update 

1  N 

their  decisions.  Were  the  coordinator  to  combine  u  to  u  "optimally" ,  the  above 

n  n 

scheme  would  reduce  to  the  one  of  the  previous  sections  and  our  past  results  would 
apply .  We  assume,  however,  that  the  coordinator  simply  sets. 


N  . 

o  v  1  i 
u  =  )  a  u 

n  n 

1=1 


i  r*  i 

where  the  coefficients  a  are  deterministic,  positive  and  £  a  =1.  The  implicit 

i=l 

behavioral  assumptions  are:  (i)  The  coordinator  has  no  memory  and  (ii)  it  need  not 
have  a  good  knowledge  of  the  problem.  It  only  knows  how  much  it  can  rely  on  each  of 
the  other  processors;  this  is  reflected  by  its  choice  of  the  coefficients  a1  which 
may  be  thought  of  as  a  "reliability  index"  for  processor  i  in  the  eyes  of  the 
coordinator.  We  then  obtain: 


Theorem  4.5.1:  The  conclusions  of  Theorem  4.3.1,  4.3.4,  4.4.1,  remain  true  (under 


their  respective  assumptions)  with  the  scheme  introduced  in  this  section. 


Proof:  Since  u  is  F  -measurable,  it  follows  that 
-  n  n+l 


N  .  .  N 

E  [c  (u1  . )  ]<  E  [c  (u°)  ]  =  E  [c  (  £  a^u^ )  ]  <  T  a']E[c(u:i)]  •  (4.5.1) 

n+i  -  n  .  ,  n  —  .  ,  n 

3=1  3=1 

Also,  since  u1  is  F1 ,  -measurable,  Elcfu1)]  decreases  and  converges  to  some  g1. 
n  n+l  n 

Taking  the  limit  in  (4.5.1)  we  conclude  that  g1=g'] ,  Vi,j.  From  this  point  on,  the 
proofs  of  Theorems  4.3.1,  4.3.4,  4.4.1  (with  minor  modifications)  are  valid  and 
establish  the  desired  conclusions .  ■ 

The  above  scheme  cam  be  viewed  as  a  framework  for  cooperation,  where  the 

coordinator  simply  aids  the  processors;  or,  for  LQG  problems,  as  another  decomposition 

algorithm.  It  can  be  also  interpreted,  however,  from  an  entirely  different  point  of 

view:  Suppose  that  the  processors  are  selfish  and  independent  individuals,  faced  with 

identical  situations,  possessing  different  information  and  having  to  make  repetitive 

decisions.  They  can  certainly  benefit  by  observing  past  decisions  of  the  other 

processors  but  assume  that  this  is  not  possible.  They  are  able,  however,  to  observe 
.  o  . 

a  weighted  average  u  of  all  decisions  made  in  the  last  stage,  which  they  take  into 

n 

account  for  their  future  actions.  The  motivation  for  such  a  model  comes  primarily 
from  economics:  Each  processor  is  a  buyer  (or  seller)  in  the  same  market  and  at  each 
stage  he  obtains  some  aggregate  information  (e.g.  the  average  price)  on  the  transac¬ 
tions  that  were  made  in  the  previous  stage.  In  this  sense  the  "coordinator"  simply 
represents  a  market  mechanism.  Our  results  state  that,  eventually,  an  "informational 
equilibirium"  will  be  reached.  Such  an  equilibrium  has  been  studied  by  Radner  [1979] 
in  a  different  setting.  However,  there  was  no  demonstration  of  an  adjustment  process 
that  could  lead  to  such  an  equilibrium.  Our  scheme  provides  a  model  of  rational 


behavior  which,  if  followed  by  each  agent,  leads  to  equilibrium. 


Moreover,  within  such  a  context  (of  selfish  individuals  confronted  with 
identical  situations)  and  for  LQG  problems  with  perfect  memory,  optimal  tentative 
decisions  constitute  a  set  of  strategies  in  Nash  equilibrium  for  a  certain  game. 

(This  is  why  optimal  tentative  decisions  can  be  called  "a  model  of  rational  behavior") 
Let  us  define  the  game  of  interest  more  precisely. 

Let  yX  be  a  vector  of  jointly  Gaussian  random  variables  that  generate  g\  the 

a-algebra  of  events  known  to  processor  i  at  time  n  if  it  had  received  no  messages. 

At  each  stage,  processor  i  selects  a  decision  u1  and  incurs  (but  does  not  observe) 

n 

n  i  o  r  i  i 

a  cost  a  c(u  ) ,  where  0<a<l  and  c  is  a  quadratic  cost  function.  Then  u  =  )  a  u 
n  n  “  n 

is  formed  and  communicated  to  all  processors.  The  total  cost  to  processor  i  is 

J1  =  £  anE[c(u;L)].  A  strategy  T1  for  processor  i  is  a  sequence  {y1,  i=l,2,.'..} 
n=l 

of  measurable  functions  such  that  uX  =  yX  (yX»u°,  . . .  ,u°  ,).  A  set  { P 1 -  rN}  of 

n  n  n  1  n-l 

strategies  is  said  to  be  in  Nash  equilibrium  if 


ji(r1,...,rx  \ 


r1 
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..,rN)>  jV1 . r1 . rN> 


for  any  strategy  T1,  for  any  i.  Let  r={rX:  i=l, ... ,N  }  denote  the  particular  set  of 
strategies  where  each  processor  at  each  stage  plays  its  optimal  tentative  decision 
(Note  that  these  are  linear  strategies) .  Then, 


Theorem  4.5.2;  T  is  a  set  of  strategies  in  Nash  equilibrium. 

Proof:  Let  u°,  u1  be  the  coordinator  messages  and  decisions,  respectively,  when  all 

-  n  n 

processors  use  T1.  Let  F1  be  the  smallest  a-algebra  generated  by  y1  and  u°,...,u° 


Now  suppose  that  a  particular  processor,  say  processor  1,  uses  a  strategy  T 


1  i  aq  a  j_ 

different  from  T  ,  while  all  other  processors  use  F  .  Let  u  ,  u  be  the  coordinator 

n  n 


messages  and  decisions  resulting  from  the  application  of  this  new  set  of  strategies . 


-1  .  _i 

We  proceed  by  induction.  Clearly,  u^  is  F^-measurable .  Assume  that  u^  and 


u°,...,u°  ,  are  F^-measurable .  Then,  because  of  the  linearity  of  the  F1,s, 
1  n-1  n. 


1  ~1  O  ~0  O  ''-O  t-1 

u  -  u  ,  (i/l) ,  is  linear  in  u,  -  u,,...,u  ,  -  u  ,  and  hence  F  -measurable.  Since 

n  n  11  n-1  n-1  n 

u°  is  F1  -measurable  and  F1  C  F1  ,  (perfect  memory),  u°  is  F^  -measurable,  and 
n  n+1  n  n+1  n  n+1 

^1 

so  is  u  ,  .  The  induction  shows  that  this  is  true  for  all  n. 
n+1 


i  i 

Therefore,  by  the  minimizing  property  of  u^,  E[c(un>]<  Efcfu^)],  yn  and  this 


completes  the  proof.* 


4.6  PROCESSORS  WITH  DIFFERENT  MODELS 

The  results  of  the  preceding  sections  rest  on  the  crucial  assumption  that  all 
processors  have  identical  models  of  the  situation;  that  is,  they  all  employ  the  same 
probability  measure  in  their  calculations .  In  this  Section  we  discuss  what  could 
happen  if  this  assumption  is  violated.  The  motivation  comes  from  the  case  in  which 
the  processors  are  human  decision  makers.  Then,  it  is  very  likely  that  their  models 
will  have  some  differences,  even  if  they  are  "experts,"  on  the  problem  facing  them. 

It  is  important  to  realize  that  there  can  be  no  unique  normative  answer  on  what 
should  the  processors  do  in  the  face  of  model  differences.  Rather,  we  may  assume 
some  general  rules  of  behavior  and  then  proceed  to  study  the  consequences  of  such 
assumptions.  We  sketch  here  a  few  possible  lines  of  approach,  leaving  more  detailed 


analysis  for  future  research. 


Note  that  two  processors  with  different  models  will  make,  in  general,  different 
decisions,  even  if  they  have  the  same  information.  This  is  the  same  as  what  would 
happen  if  they  had  identical  models  but  different  cost  functions.  (In  fact,  the 
distinction  between  model  and  cost  function  cannot  be  made  perfectly  sharp) .  So, 
differences  in  models  may  lead  to  situations  best  modelled  by  game  theory.  Moreover, 
since  the  processors  have  to  reach  consensus,  we  would  effectively  obtain  a  cooperative 
(bargaining)  game,  consensus  being  reached  somewhere  on  an  appropriate  negotiation 
set.  This  game  theoretic  approach  can  have  some  drawbacks,  mainly  because  game 
theory  requires  fairly  strong  "rationality"  assumptions.  For  example,  it  might  be 
necessary  to  assume  that 

a)  The  models  of  each  processor  are  common  knowledge,  or 

b)  The  models  of  each  processor  are  drawn - randomly  from  a  set  of  possible  models 
and  the  probability  distribution  associated  with  this  random  selection  is  common 
knowledge.  (This  latter  assumption  is  similar  to  those  introduced  by  Harsanyi  [1967; 
1968a;  1968b]  to  address  non-cooperative  games  with  different  models.  However,  very 
little  can  be  said,  in  general,  under  such  an  assumption,  mainly  because  a  probability 
distribution  over  a  set  of  models  may  be  hard  to  handle) . 

An  additional  objection  to  game  theoretic  approaches  can  be  the  following:  If 
two  decision  makers  -  with  different  models  -  want  to  cooperate  and  reach  consensus, 
they  should  not  bargain  on  an  appropriate  decision,  but  they  should  first  try  to 
reconcile  their  models. 

A  different  situation  is  obtained  if  modelling  differences  exist  but  no  processor 
knows  that  this  is  the  case.  So,  suppose  that  two  processors  employ  the  scheme  of 


the  preceding  sections.  Each  one  has  a  different  model,  but  each  one  believes 
that  they  have  the  same  model.  What  happens  then  is  that  each  processor  interprets 
incorrectly  the  messages  it  receives.  Let  us  elaborate  on  the  above  statement. 

Suppose,  for  example,  that  processor  1  computes  a  decision  u^  based  on  its  own 

model.  Its  decision  is  given  by  u^=f 1  (cj)  ,  for  some  function  f1:  ft-MJ.  Clearly,  f1 

depends  on  the  cost  function  being  minimized  as  well  as  the  prior  probabilities 

(model)  assumed  by  processor  1.  Processor  2,  however,  believes  that  processor  1  uses 

a  different  set  of  prior  probabilities  in  its  calculations.  So,  processor  2  believes 
12  2 

that  u  =f  (ou)  ,  for  some  other  function  f  :  u  -  When  processor  1  receives  a 

1  2-11 

message  with  the  value  of  u  ,  it  deduces  that  the  event  (f  )  (u  )  has  occured,  whereas 
in  fact  event  (f^)  ^(u^)  has  occured.  (Here,  the  superscript  -1  stands  for  the 
inverse  image  of  a  function).  In  other  words,  messages  code  information;  by  not 
knowing  the  coding  rule  used  by  processor  1,  processor  2  decodes  messages  incorrectly. 

2-1  1 

Suppose,  however,  that  (f  )  (u  )=<)>.  This  means,  in  the  eyes  of  processor  2: 

"if  processor  1  had  the  same  model  with  me,  it  would  not  send  me  such  a  message; 
therefore,  it  has  a  different  model".  Once  the  processors  (or  a  subset  of  them) 
realize  that  they  do  not  share  a  common  model,  we  are  back  to  the  game-like  situation 
discussed  earlier  in  this  section.  Concerning  awareness  of  modelling  differences, 
there  are  three  possibilities: 

a)  Some  processors  find  out  about  modelling  differences  after  the  scheme  has 
operated  for  finite  time  only. 

b)  No  one  finds  out  in  finite  time,  but  they  may  find  out  asymptotically.  (This 


can  be  made  more  precise) . 


c)  Modelling  differences  are  not  detected,  not  even  asymptotically . 

We  may  then  pose  questions  of  the  following  type: 

(i)  What  are  some  conditions  for.  cases  (a) , (b)  or  (c)  above  to  occur? 

(ii)  For  cases  (a)  and  (b)  what  can  be  said  about  disagreement  at  the  time  that 
modelling  differences  are  detected? 

(iii)  For  case  (c) ,  the  processors  must  asymptotically  agree.  What  is  it  they  agree 
upon? 

(iv)  Suppose  that  in  addition  to  running  the  scheme  of  this  chapter,  each  processor 
performs  a  (sequential)  hypothesis  test,  to  test  for  modelling  differences.  When 
will  be  modelling  differences  detected? 

We  have  no  results  on  the  above  questions  which  are  topics  for  future  research. 
We  coin  draw  the  conclusion,  however,  that  modelling  differences  may  lead  to  a  variety 
of  qualitatively  different  situations.  Exploring  such  situations  may  lead  to  some 
understanding  of  decision  making  in  the  face  of  modelling  differences  which  is  an 
area  of  great  practical  importance. 

Remark :  A  recent  paper  by  Teneketzis  and  Varaiya  [1984]  shows  that  in  the  case  of 
an  estimation  problem  on  a  finite  probability  space  and  under  perfect  memory,  case 
(b)  cannot  occur  and  if  case  (c)  occurs  then  the  processors  agree  in  finite  time. 

In  fact,  it  is  easy  to  show  (using  Theorem  4.3.3)  that  this  result  remains  valid 
under  imperfect  memory  and  for  an  arbitrary  cost  function. 


4.7  CONCLUSIONS 


A  set  of  processors  with  the  same  objective  who  start  communicating  to  each 
other  their  tentative  optimal  decisions  are  guaranteed  to  agree  in  the  limit.  Under 
certain  assumptions,  this  is  true  even  if  the  messages  are  received  in  the  presence 
of  noise  and  even  if  the  memory  of  the  processors  is  limited  and  they  are  allowed  to 
forget  some  of  their  past  knowledge.  Moreover,  they  are  guaranteed  to  converge  to 
the  centralized  optimal  decision  for  linear  estimation  problems,  provided  that  no 
processor  forgets  its  own  raw  observations .  This  corresponds  to  a  geometrically  con¬ 
vergent  decomposition  algorithm  for  static  linear  estimation  problems,  on  which  some 
numerical  results  are  reported. 

Similar  results  are  obtained  if  the  processors  do  not  communicate  directly  but 
receive  messages  from  a  coordinator  who  evaluates  a  weighted  average  of  all  tentative 
decisions.  In  the  latter  framework,  for  linear  estimation  problems  and  with  perfect 
memory,  optimal  tentative  decisions  correspond  to  Nash  strategies  for  a  certain 
sequential  game  and  admit  an  economic  interpretation. 

These  results  are  valid  when  all  processors  share  the  same  model  (identical  prior 
probabilities) .  The  characterization  of  the  behavior  of  processors  with  different 
models  is  an  unexplored  problem. 

Remark:  In  a  recent  paper,  Washburn  and  Teneketzis  [1984]  have  studied  similar 
schemes  from  a  more  abstract  point  of  view.  Their  approach  is  limited,  however,  to 
the  special  case  in  which  all  processors  have  perfect  memory. 


CHAPTER  5:  DECENTRALIZED  DETERMINISTIC 
AND  STOCHASTIC  ITERATIVE  ALGORITHMS 

5.1  INTRODUCTION  AND  OVERVIEW 

From  a  conceptual  point  of  view,  this  Chapter  may  be  viewed  as  a  continuation 
of  Chapter  4.  We  consider  again  decentralized  schemes  in  which  a  set  of 
processors  make  (tentative)  decisions,  perform  computations  or  obtain  observa¬ 
tions  and  exchange  relevant  messages,  with  the  end-goal  of  minimizing  a  certain 
cost  function.  In  the  schemes  of  Chapter  4,  each  processor  was  assumed  to  be  as 
"rational"  as  possible:  everything  was  imbedded  in  a  Bayesian  framework  and 
processors  were  making  tentative  decisions  which  were  optimal,  given  their 
information.  The  main  practical  difficulty  with  such  schemes  is  that,  at  each 
stage,  each  processor  may  face  an  optimization  problem  which  is  difficult  by 
itself;  the  only  exception  is  the  LQG  problem  of  Section  4.4.  For  this  reason, 
it  may  be  more  useful  to  assume  that  processors  do  not  update  in  an  "optimal" 
(Bayesian)  way,  but  rather  that  they  update  by  moving  a  little  bit  in  a  direction 
of  improvement.  In  other  words,  we  are  interested  in  gradient-like  (or  descent) 
iterative  schemes.  Although  a  descent  assumption  could  be  considered  to  be  an 
arbitrary  restriction  it  should  be  stressed  that  such  an  assumption  underlies 
most  centralized  iterative  schemes  for  deterministic  optimization,  recursive 
identification,  stochastic  approximation,  random  search  algorithms,  adaptation 
and  training  algorithms.  [Poljak  and  Tsypkin,  1973;  Ljung,  1977a].  Given  the 
multitude  of  application  areas  for  centralized  descent  algorithms,  it  should  not 
be  surprising  that  their  decentralized  counterparts  may  be  useful  in  a  variety  of 
contexts  as  well.  In  fact,  we  discuss  in  later  sections  applications  in  parallel 
computing  systems,  distributed  signal  processing  (identification),  decisionmaking 
in  large  organizations  and  data  communication  networks. 


Let  us  now  elaborate  on  what  is  a  genuine  "decentralized  algorithm." 

Given  any  centralized  algorithm,  it  is  often  straightforward  to  design  a 
decentralized  (parallel)  implementation,  by  employing  a  set  of  perfectly 
synchronized  processors.  Such  a  decentralized  algorithm  is  mathematically 
identical  to  the  original  centralized  one  and  no  new  analysis  is  required. 
Synchronous  algorithms  may  have,  however,  certain  drawbacks:  a)  Synchronism 
may  be  hard  to  enforce,  or  its  enforcement  may  introduce  substantial  overhead, 
b)  Communication  delays  may  introduce  bottlenecks  to  the  speed  of  the  algorithm 
(the  time  required  for  one  stage  of  the  algorithm  will  be  constrained  by  the 
slowest  communication  channel) .  c)  Synchronous  algorithms  may  require  far  more 
communications  than  are  actually  necessary,  d)  Even  if  all  processors  are 
equally  powerful  some  will  perform  certain  computations  faster  than  others,  due 
solely  to  the  fact  that  they  operate  on  different  inputs.  This  in  turn,  may  lead 
to  having  many  processors  idle  for  a  large  proportion  of  time.  For  these  reasons, 
we  choose  to  study  asynchronous  decentralized  iterative  optimization  algorithms 
in  which  each  processor  does  not  need  to  communicate  to  each  other  processor  at 
each  time  instance;  also,  processors  may  keep  performing  computations (or,  in  a 
decision  making  context,  update  their  decisions)  without  having  to  wait  until 
they  receive  the  messages  that  have  been  transmitted  to  them;  processors  are 
allowed  to  remain  idle  some  of  the  time;  finally,  some  processors  may  perform 
computations  faster  than  others.  Such  schemes  can  alleviate  communication  overloads 
and  they  are  not  excessively  slowed  down  by  neither  communication  delays,  nor 


by  differences  in  the  time  it  takes  processors  to  perform  one  computation. 

(A  similar  discussion  of  the  merits  of  asynchronous  algorithms  is  provided  by 
H.T.  Kung,  [1976].) 

We  now  outline  the  contents  of  this  Chapter.  In  Section  5.2,  we  present 
a  model  for  decentralized  computation  which  is  a  general  description  of  the 
structure  of  most  algorithms  to  be  studied  in  this  Chapter.  This  model  vis 
communications  between  processors  to  be  infrequent  and  fairly  chaotic,  as  well 
as  communication  delays.  An  interesting  and  useful  feature  of  this  model  is 
that  a  global  (aggregate)  state  of  computation  may  be  associated  (in  a  non¬ 
trivial  way)  with  a  decentralized  algorithm. 

In  Section  5.3  we  prove  convergence  of  decentralized  gradient-like 
stochastic  algorithms.  We  consider  constant  step-size  algorithms  (deterministic 
gradient  methods  being  a  special  case) ,  as  well  as  decreasing  step-size  algorithms 
For  the  latter  case,  we  show  that  convergence  is  obtained  even  if  the  time  between 
consecutive  communications  goes  to  infinity,  as  the  algorithm  progresses. 

In  Section  5.4  we  consider  stochastic  algorithms  with  correlated  noise  and 
prove  convergence  under  assumptions  similar  to  those  that  are  employed  to  show 
convergence  of  centralized  algorithms  via  the  ODE  approach. 

In  Section  5.5  we  present  some  applications  of  our  results  in  decentralized 
system  identification. 

In  Section  5.6  we  consider  a  decentralized  (deterministic)  gradient  algo¬ 
rithm.  This  being  a  particularly  simple  algorithm,  we  are  able  to  study  in  more 
detail  the  effect  of  the  various  parameters  describing  the  structure  of  the 


communication  process.  In  Section  5.7  we  indicate  that  these  results  may 
form  the  basis  of  a  new  approach  to  organizational  design  problems.  Then,  in' 
Section  5.8  we  discuss  a  few  more  potential  applications  of  our  results. 

In  Section  5.9  we  outline  some  topics  for  further  research;  finally, 

Section  5.10  contains  a  summary  and  some  general  conclusions. 

A  simpler  version  of  Sections  5.2,  5.3  appears  in  [Tsitsiklis,  Bertsekas 
and  Athans,  1984].  Also,  a  major  part  of  this  Chapter  is  discussed  in 
[Bertsekas,  Tsitsiklis  and  Athans,  1984]. 

5.2  A  MODEL  OF  DECENTRALIZED  COMPUTATION 

We  present  here  the  model  of  decentralized  computation  employed  in  this 
chapter.  We  also  define  the  notation  and  conventions  to  be  followed.  Related 
models  of  decentralized  computation  have  been  used  by  Chazan  and  Miranker  [1969] , 
Baudet  [1978]  and  Bertsekas  [1982,1983]  where  each  processor  specialized  in 
updating  a  different  component  of  some  vector.  The  model  developed  here  is  more 
general,  in  that  it  allows  different  processors  to  update  the  same  component  of 
some  vector.  If  their  individual  updates  are  different,  their  disagreement  is 
(asymptotically)  eliminated  through  a  process  of  communicating  and  combining  their 
individual  updates.  In  such  a  case,  we  will  say  that  there  is  overlap  between 
processors.  Overlapping  processors  are  probably  not  very  useful  in  the  context 
of  deterministic  algorithms,  unless  redundancy  is  intended  to  provide  a  certain 
degree  of  fault  tolerance.  For  stochastic  algorithms,  however,  overlap  essentially 
corresponds  to  having  different  processors  obtain  noisy  measurements  of  the  same 
unknown  quantity  and  effectively  increases  the  "signal -to-noise  ratio." 
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Let  H  ,  H  ,  ...,H  be  Banach  spaces^"  and  let  H=H.  xH  x _ xHT  .  If 

L  Z  Lj  1  2  L 

x  *  . . .  ,x^)  ,  x^  €  H^,  we  will  refer  to  x^  as  the  Jl-th  component  of  x. 

We  endow  H  with  the  norm 

||(x1,x2,...,xL)  |  |  *  maxj  |xj  |  (5.2.1) 

Let  {1 . M}  be  the  set  of  processors  that  participate  in  the  dis¬ 

tributed  computation.  As  a  general  rule  concerning  notation,  we  use  subscripts 
to  indicate  a  component  of  an  element  of  H,  superscripts  to  indicate  an  associated 
processor;  we  indicate  time  by  an  argument  that  follows. 

The  algorithms  to  be  considered  evolve  in  discrete  time.  Even  if  a  dis¬ 
tributed  algorithm  is  asynchronous  and  communication  delays  are  real  (i.e.,  not 
integer)  variables,  the  events  of  interest  (an  update  by  some  processor,  trans¬ 
mission  or  reception  of  a  message!  may  be  indexed  by  a  discrete  variable;  so, 
the  restriction  to  discrete  time  entails  no  loss  of  generality. 

It  is  important  here  to  draw  a  distinction  between  "global"  and  "local" 
time.  The  time  variable  we  have  just  referred  to  corresponds  to  a  global  clock. 
Such  a  global  clock  is  needed  only  for  analysis  purposes:  it  is  the  clock  of  an 
analyst  who  watches  the  operation  of  the  system.  On  the  other  hand  the  processors 
may  be  working  without  having  access  to  a  global  clock.  They  may  have  access  to 
a  local  clock  or  to  no  clock  at  all.  We  will  see  later  that  our  results,  based 
on  the  existence  of  a  global  clock,  may  be  used  in  a  straightforward  way  to  prove 
convergence  of  algorithms  implemented  on  the  basis  of  local  clocks  only. 

We  assume  that  each  processor  has  a  buffer  in  its  memory  in  which  it  keeps 
some  element  of  H.  The  value  stored  by  the  i-th  processor  at  time  n  is  denoted 

t  In  most  situations  of  interest,  each  H.  will  turn  out  to  be  finite  dimensional. 

1 

However,  the  generalization  to  Banach  spaces  does  not  introduce  any  new 
difficulties  in  our  analysis. 
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by  x1(n).  At  time  n,  each  processor  may  receive  some  exogenous  measurements 
and/or  perform  some  computations.  This  allows  it  to  compute  a  "step"  s1(n)eH, 
to  be  used  in  evaluating  the  new  vector  x1 (n+1) .  Besides  their  own  measure¬ 
ments  and  computations,  processors  may  also  receive  messages  from  other  processors 
which  will  be  taken  into  account  in  evaluating  their  next  vector.  The  process 
of  communications  is  assumed  to  be  as  follows: 

At  any  time  n,  processor  i  may  transmit  some  (possibly  all)  of  the 
componentsof  x1 (n)  to  some  (possibly  all  or  none)  of  the  other  processors. 

(In  a  physical  implementation,  messages  do  not  need  to  go  directly  from  their 
origin  to  their  destination;  they  may  go  through  some  intermediate  nodes.  Of 
course,  this  does  not  change  the  mathematical  model  presented  here.)  We  allow 
for  the  time  being,  arbitrary  communication  delays.  For  convenience,  we  also 
assume  that  for  any  pair  (i,j)  of  processors,  for  any  component  x^  and  any  time 
n,  processor  i  may  receive  at  most  one  message  originating  from  processor  j  and 

containing  an  element  of  H^.  This  leads  to  no  significant  loss  of  generality: 
for  example,  a  processor  that  receives  two  messages  simultaneously  could  keep 
only  the  one  which  was  most  recently  sent;  if  messages  do  not  carry  timestamps, 
there  could  be  some  other  arbitration  mechanism.  Physically,  of  course,  simul¬ 
taneous  receptions  are  impossible;  so,  a  processor  may  always  identify  and  keep 
the  most  recently  received  message,  even  if  all  messages  arrived  at  the  same 
discrete  time  n. 

If  a  message  from  processor  j ,  containing  an  element  of  is  received  by 
processor  i  (i?fj)  at  time  n,  let  t^  (n)  denote  the  time  that  this  message  was 
sent.  Therefore,  the  content  of  such  a  message  is  precisely  jcl(t  (n) )  . 


I 

r. 

»*/ 

»*  * 

* 

► 

i 

. 

L  - 


Naturally,  we  assume  that  t^  (n)<  n.  For  notational  convenience,  we  also  let 
t^X(n)=n,  for  all  i,  i,  n.  We  will  be  assuming  that  the  algorithm  starts  at 
time  1;  accordingly,  we  assume  that  t£3(n)>^  1.  Finally,  we  denote  by  T^3 
the  set  of  all  times  that  processor  i  receives  a  message  from  processor  j , 
containing  an  element  of  .  To  simplify  matters  we  will  assume  that,  for  any 
i,j,£,  the  set  T^3  is  either  empty  or  infinite. 

Once  processor  i  has  received  the  messages  arriving  at  time  n  and  has  also 
evaluated  s1(n),  it  evaluates  its  new  vector  x1(n+l)6H  by  forming  (componentwise) 
a  convex  combination  of  its  old  vector  and  the  values  in  the  messages  it  has 
just  received,  as  follows: 

M  .  .  ... 

x^(n+l)  =  l  a^3  (n)x3 (t^3 (n) )  +  y1  (n) s £ (n) ,  n>l,  (5.2.2) 

j=l 

where  s^(n)  is  the  Z-th  component  of  s1 (n)  and  the  coefficients  a^3 (n)  are 
scalars  satisfying: 

(i)  a^3(n)>0,  Vi»  j  i^n,  (5.2.3) 

M 

(ii)  l  a*3 (n) =1,  Vi,£,n,  (5.2.4) 

j=l  * 

(iii)  a*3(n)=0,  vn0T^3,  iytj  (5.2.5) 

Remarks : 

1.  Note  that  t^3 (n'  lias  been  defined  only  for  those  times  n  that  processor  i 
receives  a  message  of  a  particular  type,  i.e.  for  n6T»3 .  However,  whenever 
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(n)  is  undefined,  we  have  assumed  above  that  a^(n)=0,  so  that 
equation  (5.2.2)  has  an  unambiguous  meaning. 

When  we  refer  to  a  processor  performing  a  "computation" ,  we  mean  the 
evaluation  and  addition  of  the  term  y1(n)s^(n)  in  (5.2.2).  With  this 
terminology,  forming  the  convex  combination  in  (5.2.2)  is  not  called 
a  computation.  We  denote  by  the  set  of  all  times  that  processor 
i  performs  a  computation  involving  the  £-th  component.  Whenever  n^T^, 
it  is  understood  that  s J (n)  in  (5.2.2)  equals  zero.  We  assume  again 
that  for  any  i,£  the  set  is  either  infinite  or  empty.  Accordingly, 

processor  i  will  be  called  computing ,  or  non-computing ,  for  component  l. 

The  quantities  y1(n)  in  (5.2.2)  are  nonnegative  scalar  step-sizes.  These 

step-sizes  may  be  constant  (e.g.  YX(n)«Y  »  Vn),  or  time- varying ,  e.g. 

Y1(n)=l/t1,  where  t1  is  the  number  of  times  that  processor  i  has  performed 
n  n 

a  computation  up  to  time  n.  Notice  that  with  such  a  choice  each  processor 
may  evaluate  its  step- size  using  only  a  local  counter  rather  than  a  global 
clock . 

The  envisaged  sequence  of  events  underlying  (5.2.2)  at  any  time  n,  is  as 
follows :  processor  i 

(i)  Transmits  x£(n)  to  other  processors. 

(ii)  Receives  messages  x^(t^(n))  from  other  processors. 

(iii)  Computes  s^(n) . 

(iv)  Evaluates  (n+1) . 


5.  Note  that  the  combining  of  the  vectors  of  different  processors,  in 


equation  (5.2.2),  is  done  componentwise.  Consequently,  we  can  argue 
inductively  that  x^(n+l)  is  only  a  function  of  {s^(k):  je{l,...,M}, 
ke{l,...,n}}  and  (x^d):  je{l, . . . ,M>} . 

Examples 

We  now  introduce  a  collection  of  simple  examples  representing  various  classes 
of  algorithms  we  are  interested  in,  so  as  to  illustrate  the  nature  of  the 
assumptions  to  be  introduced  later.  We  actually  start  with  a  broad  classification 
and  then  proceed  to  more  special  cases.  In  these  examples,  we  model  the  message 
receptions  and  transmissions  (i.e.  the  sets  and  the  variables  t^3 (n) )  , 

the  times  at  which  computations  are  performed  (i.e.  the  sets  T^)  and  the 
combining  coefficients  (n)  as  deterministic.  (This  does  not  mean,  however, 
that  they  have  to  be  a  priori  known  by  the  processors).  We  will  see  in  Section  5.3 
that  this  assumption  may  be  relaxed. 


Specialization :  This  is  the  case  considered  by  Bertsekas  [1982,  1983],  where 
each  processor  updates  a  particular  component  of  the  x  vector  specifically 
assigned  to  it  and  relies  on  messages  from  the  other  processors  for  the  remaining 


components .  Formally : 


(i)  M=L.  (There  are  as  many  processors  as  there  are  components) . 

(ii)  s£(n)=0,  Vjl^i,  Vn.  (A  processor  may  update  only  its  own  component; 

Tj=<J>,Vi*Jt)  • 

(iii)  Processor  j  only  sends  messages  containing  elements  of  H.;  if  processor  i 

i  ■*  i 

receives  such  a  message,  it  uses  it  to  update  x^  by  setting  x^  equal  to 
the  value  received.  Equivalently, 


b)  If  processor  i  receives  a  message  from  processor  j  at  time  n, 

i.e.  if  neT1-3,  then  a^](n)=l-.  Otherwise,  a1:)(n)=0,  and  a1X(n)=l. 

3  3  D  3 

Overlap:  This  is  the  other  extreme,  at  which  L=1  (we  do  not  distinguish  com¬ 
ponents  of  elements  of  H) ,  messages  contain  elements  of  H  (not  just  components) 
and  each  processor  may  update  any  component  of  x.  (For  this  case  subscripts  are 
redundant  and  will  be  omitted. 

We  now  let  H  be  finite-dimensional  and  assume  that  J:  H-»-[0,°°)  is  a 
continuously  differentiable  nonnegative  function  with  a  Lipschitz  continuous 
derivative . 


Example  I:  Deterministic  Gradient  Algorithm;  Specialization.  Let 
Y  (n)  =  Y  >0,  Vn,i.  At  each  time  nST1  that  processor  i  updates  x'?’,  it  computes 


(n)  -  (x  (n) )  and  lets  s  .(n)=Q,  for  j^i.  We  assume  that  each  processor  i 

Xi  J 


communicates  its  component  x^  to  every  other  processor  at  least  once  every 


time  units,  for  some  constant  Other  than  this  restriction,  we  allow  the 

transmission  and  reception  times  to  be  arbitrary.  (A  related  stochastic  algorithm 


could  be  obtained  by  letting  s. (n)  =  -  —  (x1 (n) ) (l+w^tn)),  where  w1(n)  is  unit 

x  ox.  x  i 

x 


variance  white  noise,  independent  for  different  i's). 


Example  II;  Newton's  Method;  Overlap.  For  simplicity  we  assume  that  there 


are  only  two  processors  (M=2) .  Let  Y^n)  =  Y  >0,  Vn.  We  also  assume  that  J  is 

o 


twice  continuously  differentiable,  strictly  convex  and  its  Hessian  matrix,  denoted 


by  G(x),  satisfies  CKd^I^G (x) <6  i, VxfiH.  At  each  time  n€T  ,  processor  i 

computes  s^(n)  =  -G  ^ (x^  (n) )  (x^  (n) )  .  If  at  time  n  processor  1  (respectively, 

2  12  1  21 

2)  receives  a  message  x  (t  (n) )  ‘  (resp.  x  (t  (n) ) ) ,  it  updates  its  state  vector 
by  x1(n+l)  =  allx1(n)  +  a^x2  (t12  (n) ) ,  (resp.  x2(n+l)  =  a^x1  (t21  (n) )  +  a22x2(n)). 

Here  we  assume  that  0<a. .<1  and  that  a,,+a,_  =  a„  +a„^  =1.  We  make  the  same 

jl:  11  12  21  22 

assumptions  on  transmission  and  reception  times  as  in  Example  I . 

Example  III:  Distributed  Stochastic  Approximation;  Specialization.  Let  y1 (n) 

be  such  that,  for  some  positive  constants  A^ ,  A2,  A^/n  <  y1 (n)<  A2/n,  Vn. 

Notice  that  the  implementation  of  such  a  stepsize  only  requires  a  local  clock  that 

runs  in  the  same  time  scale  (i.e.  within  a  constant  factor)  as  the  global  clock. 

For  neT*-,  let  s‘!"(n)  =  -  (x^(n))  +  w*(n).  Also,  si(n)=0,  for  i^j  and  for 

i  i  dx.  i  i 

i  J 

all  n.  We  assume  that  w . (n) ,  conditioned  on  the  past  history  of  the  algorithm 

i  2  ■ 

has  zero  mean  and  that  E[||w  (n)||  |x  (n)]£K(J(x  (n))+l),  for  some  constant  K. 

We  assume  that  for  some  B^X),  $>1  and  for  all  n,  each  processor  communicates  its 
component  x^  to  every  other  processor  at  least  once  during  the  time  interval 
3  3 

[B^n  ,B^ (n+1) P] .  Other  than  the  above  restriction,  we  allow  transmission  and 
reception  times  to  be  arbitrary.  Notice  that  the  above  assumptions  allow  the  time 
between  consecutive  communications  to  grow  without  bound. 

Example  IV;  Distributed  Stochastic  Approximation;  Overlap.  Let  y1 (n)  be  as 
in  Example  III  and  let  M=2.  For  n6T^,  let  s^tn)  =  — (x'*’  (n) )  tw^n),  where 
wl (n)  is  as  in  Example  III .  We  make  the  same  assumptions  on  transmission  and 
reception  times  as  in  Example  III.  Whenever  a  message  is  received,  a  processor 
combines  its  vector  with  the  content  of  that  message  using  the  combining  rules 


of  Example  II. 
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Example  V:  This  example  is  rather  academic  but  will  serve  to  illustrate 

some ‘of  the  ideas  to  be  introduced  later.  Consider  the  case  of  overlap,  assume 

that  H  is  one-dimensional,  and  let  Y1(n)=l,  Vn.  Assume  that,  at  each  time  n, 

either  all  processors  communicate  to  each  other,  or  no  processor  sends  any 

message.  Let  the  communication  delays  be  zero  (so,  t13(n)=n,  whenever  t13  (n) 

is  defined)  and  assume  that  a1^ (n)  =  a1^  (constant)  at  those  times  n  that  mes- 

1  M 

sages  are  exchanged.  We  define  vectors  x(n)  =  (x  (n),...,x  (n) )  and 
1  M 

s(n)  =  (s  (n),...,s  (n) ) .  Then,  the  algorithm  (5.2.2)  may  be  written  as 

x(n+l)  =  A(n)x(n)  +  s (n)  .  (5.2.6) 

For  each  time  n,  either  A(n)=I  (no  communications)  or  A(n)=A,  the  matrix  con¬ 
sisting  of  the  coefficients  a1^ .  The  latter  is  a  "stochastic"  matrix:  it  has 
nonnegative  entries  and  each  row  sums  to  1.  We  assume  that  a13  is  positive. 

It  follows  that  A  =  lim  An  exists  and  has  identical  rows  with  positive  elements. 

n-»° 

We  assume  that  the  time  between  consecutive  communications  is  bounded  but 

n  _ 

otherwise  arbitrary.  Clearly  then,  lim  II  A(m)=A,  for  all  k.  This  example 

n-*°°  m=k 

corresponds  to  a  set  of  processors  who  individually  solve  the  same  problem  and, 
from  time  to  time,  simultaneously  exchange  their  partial  results.  It  is  interest¬ 
ing  to  compare  equation  (5.2.6)  with  the  generic  equation 

x(n+l)  =  x(n)  +y(n)s(n) 
which  arises  in  centralized  algorithms. 

Assumptions  on  the  communications  and  the  combining  coefficients 

We  now  consider  a  set  of  assumptions  on  the  nature  of  the  communications  and 
combining  process,  so  that  the  preceding  examples  appear  as  special  cases.  We 


start  with  a  relatively  simple  set  of  assumptions  which  are  very  easy  to  enforce, 

examine  their  consequences  and  finally  suggest  a  more  general  version. 

For  each  component  &e{l,...L}  we  introduce  a  directed  graph  G^=(V,E^)  with 
nodes  V={l,...,M}  corresponding  to  the  set  of  processors.  An  edge  (j,i)  belongs 
to  if  and  only  if  is  infinite,  that  is,  iff  processor  j  sends  an  infinite 

number  of  messages  to  processor  i  with  a  value  of  the  2,-th  component  x^. 

Assumption  5.2.1;  For  each  component  &e{l , . . . ,L} ,  the  following  hold: 

a)  There  is  at  least  one  computing  processor  for  component  Jt. 

b)  There  is  a  directed  path  in  G^,  from  every  computing  processor  (for  component  l) 
to  every  other  processor  (computing  or  not) . 

c)  There  is  some  cx>0  such  that: 

(i)  If  processor  i  receives  a  message  from  processor  j  at  time  n 
(i.e.  if  n6T^]),  then  a^(n)>a. 

ii 

(ii)  For  every  computing  processor  i,  a^  (n)>ot,  Vn. 

(iii)  If  processor  i  has  in-degree  (in  G^)  larger  or  equal  than  2,  then 

ii , 

a^  (n)>a,  Vn. 

Let  us  pause  to  indicate  the  intuitive  content  of  part  (c)  of  Assumption  5.2.1, 
Part  (i)  states  that  a  processor  should  not  ignore  the  messages  it  receives. 

Part  (ii)  requires  the  past  updates  of  any  computing  processor  to  have  a  lasting 
effect  on  its  state  of  computation.  Finally,  part  (iii)  implies  that  if  processor  .i 
receives  messages  from  two  processor  (say  i^,i2),  ^oes  not  f°r<?et  the  effects 
of  messages  of  processor  i,  upon  reception  of  a  message  from  processor  i  .  These 


conditions,  together  with  part  (b)  of  the  Assumption,  guarantee  that  any  update 


by  any  computing  processor  has  a  lasting  effect  on  the  states  of  computation 
of  all.  other  processors. 

Assumption  5.2.2;  The  time  between  consecutive  transmissions  of  component 
from  processor  j  to  processor  i  is  bounded  by  some  B^X),  for  all  (j,i)6E^. 

Assumption  5.2.3;  There  are  constants  B^X),  0>1  such  that,  for  any  (j,i)€E^, 

and  for  any  n,  at  least  one  message  x^  is  sent  from  processor  j  to  processor  i 

8  8 

during  the  time  interval  [B^n  ,  (n+1) p] .  Moreover,  the  total  number  of 

messages  transmitted  and/or  received  during  any  such  interval  is  bounded. 

Assumption  5.2.4:  Communication  delays  are  bounded  by  some  Bq>0. 

Note  that  Assumption  5.2.2  is  a  special  case  of  5.2.3,  with  8=1 •  Assumption 
5.2.1  holds  for  all  the  examples  introduced  above.  Assumption  5.2.2  holds  for 
Examples  I,II,V;  Assumption  5.2.3  holds  for  Examples  I I I, IV,  except  for  its  last 
part  which  has  to  be  explicitly  introduced.  Assumption  5.2.4  also  needs  to  be 
explicitly  introduced. 

We  now  investigate  the  consequences  of  Assumptions  5. 2. 1-5. 2. 4.  Note  that 
equation  (5.2.2)  which  defines  the  algorithm  is  a  linear  system.  In  particular, 
it  is  easy  to  see  that,  for  any  i ,n,Jl,  the  variable  x^(n+l)  is  a  linear  combination 

of  the  initial  conditions  {x^ (1)  :  j=l, _ ,m}  and  the  "inputs"  {y^(k)s^(k): 

k=l,...,n;  j=l, ...  ,m)  .  Therefore,  there  exist  scalars  $^(n|k),  for  n>k,  such  that 


M  .  .  n-1  M  .  . 

xj(n)  =1  $J:(n|0)xJ(l)+  l  l  Y^kM^OilkJs^k)  . 


(5.2.7) 
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The  coefficients  (f^Cnjk)  are  determined  by  the  sequence  of  transmission  and 
reception  times  and  the  combining  coefficients.  Consequently,  they  are  unknown, 
in  general.  Nevertheless,  they  have  the  following  qualitative  properties. 


(5.2.12) 


The  proof  of  Lemma  5.2.1  is  given  in  Appendix  A,  so  as  not  to  disrupt 
continuity.  Let  us  rather  try  to  interpret  this  result.  Consider  a  scenario 
whereby  at  time  n  all  processors  cease  computing;  that  is  s1(m)=0/  Vm>n. 

Suppose ,  however ,  that  they  keep  communicating  and  combining.  For  this  scenario, 
equation  (5.2.7)  yields 

M  . .  n-1  M 

xj(m)  =  l  *J3(m|o)xj[(l)  +  l  l  yJ  (k)^3  (m|k)s3(k).  (5.2.13) 

j=l  k=l  j=l 

Taking  the  limit,  as  m^»,  and  using  part  (ii)  of  Lemma  5.2.1,  we  conclude  that 
the  limit  exists,  is  independent  of  i  and  equals  y^(n)  defined  by 

M  n-1  M 

yz(n)  -  l  $3(0)xj(l)  +  l  ■  l  Y3  (k)*3(k)s3  (k)  .  (5.2.14) 

j=l  k=l  j=l  X 

So,  all  processors  asymptotically  agree  on  a  common  value.  Moreover,  (5.2.10) 
states  that  the  limit  depends  by  a  non-negligible  factor  on  the  updates  of  any 
computing  processor.  Finally,  parts  (iii)  and  (iv)  of  the  Lemma  quantify  the 
natural  relationship  between  the  frequency  of  inter-processor  communications  and 
the  speed  at  which  agreement  is  reached. 

It  turns  out  that  the  results  to  be  derived  later  depend  only  on  the  fact 
that  Lemma  5.2.1  holds.  For  this  reason  we  could,  for  example,  remove  Assumptions 
5.2.3  and  5.2.4  and  introduce  inequality  (5.2.12)  as  an  independent  Assumption. 
This  would  add  some  more  generality  to  our  results:  for  example,  inequality 
(5.2.12)  may  be  valid  without  the  communication  delays  being  bounded.  Nevertheless 
we  choose  to  retain  Assumptions  5. 2. 1-5. 2. 4  because  they  are  easy  to  enforce  or 


verify  and  are  simpler  to  visualize. 


We  now  continue  with  the  development  of  the  consequences  of  Lemma  5.2.1. 

For  any  pair  (i,j)  of  processors,  we  define  a  linear  trans formation  41"1  (njk) 
by 

4^  (njk)x  =  (n|k)x1# . ..  ,4^  (n|k)xL)  ,  (5.2.15) 

where  x=  (x^, . . .  ,x^) 6H.  Note  that  if  each  is  one-dimensional,  then  41',(n|k) 
may  be' represented  by  a  diagonal  matrix.  If  each  is  finite-dimensional, 

413 (n|k)  is  a  block-diagonal  matrix,  with  the  Jl-th  block  being  a  multiple  of  the 
identity  matrix  of  dimension  equal  to  the  dimension  of  H^.  Finally,  in  the 
infinite  dimensional  case,  4^  (njk)  is  a  (bounded)  linear  operator,  with  very 
simple  structure.  We  norm  bounded  linear  operators  4:  H-+-H,  using  the  norm  induced 
by  the  norm  of  H.  That  is, 

11*11  -  .  sup  *  (5.2.16) 

IN  |*>  11*11 

It  is  then  a  trivial  consequence  of  Lemma  5.2.1  that  lim  41“l(n|k)  exists  and  is 

ir*°° 

independent  of  i;  it  will  be  denoted  by  4^  (k) .  In  fact, 

||4lj (n|k)-4j (k)||  -  max  | 4lj (n|k) -4? (k) | ,  (5.2.17) 

l  *- 

which  in  conjunction  with  (5.2.11)  or  (5.2.12)  provides  bounds  on  the 

convergence  rate  of  41^ (njk). 

We  define  a  vector  y (n) €H  by 

M  n-1  M  • 

y(n)  =  l  4:(0)x:,(l)  +  l  l  Y3  (k)43  (k) s3  (k)  . 


(5.2.18) 
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This  definition  is  consistent  with  (5.2.14).  In  fact,  y(n)=(y. (n) , . . . ,y  (n)) . 

t  L 

Notice  that  y(n)  is  recursively  generated  by 

M 

y(n+l)  =  y(n)  +  l  ^(n^njs^n)  .  (5.2.19) 

3=1 

The  vector  y(n)  is  the  element  of  H  at  which  all  processors  would  asymptotically 
agree  if  they  were  to  stop  computing  (but  keep  communicating  and  combining)  at 
a  time  n. 

It  may  be  viewed  as  a  concise  global  summary  of  the  state  of  computation  at 
time  n,  in  contrast  with  the  vectors xX(n)  which  are  the  local  states  of  computation; 
it  allows  us  to  look  at  the  algorithm  from  two  different  levels:  an  aggregate  and 
a  more  detailed  one.  We  will  see  later  that  this  vector  y(n)  is  also  a  very  con¬ 
venient  tool  for  proving  convergence  results.  We  have  noted  earlier  that  equation 
(5.2.2)  is  a  linear  system.  However,  it  is  a  fairly  complicated  one,  whereas  the 
recursion  (5.2.19)  corresponds  to  a  very  simple  linear  system  in  standard  state 
space  form.  The  content  of  the  vector  y(n)  and  of  the  $J(n)'s  is  easiest  to 
visualize  in  two  special  cases: 

Specialization:  (e.g.  Examples  I  and  III).  Here  y(n)  takes  each  component  from 
the  processor  who  specializes  in  that  component.  That  is,  y(n)=(xj(n) , . . . ,x£J(n)) . 
Accordingly,  4>j(n)  =0,  forij^j,  and  (n)*l. 

ii  n_1 

Example  V:  Here  $  J(n|k)  is  the  ij-th  entry  of  the  matrix  II  A(m) .  It  follows 

m=k+l 

that  the  limit  of  $1^(n|k)  is  the  ij-th  entry  of  A,  which  by  our  assumptions 


depends  only  on  j.  Moreover,  y(n)  equals  any  component  of  Ax(n).  (All  com¬ 


ponents  are  equal  by  our  assumptions.)  If  we  multiply  both  sides  of  (5.2.6) 

M  _ 

by  A  and  note  that  AA(n)=K,  we  obtain  y(n+l)=y(n)+  £  A.  .s^(n),  which  is 


j-1 


ij 


precisely  (5.2.19). 


We  conclude  this  section  with  some  more  discussion  of  the  linear  system 
(5.2.2).  We  first  note  that  (5.2.2)  is  not  a  state  space  representation  of  a 
linear  system  because  of  the  presence  of  delays.  In  other  words,  the  present 
states  of  computation  (x^n):  i=l,...,M}  and  the  future  inputs 
(s1(k):  i=l,...,M;  k>n}  are  not  sufficient  to  determine  the  future  values 
xx(k),  k>n.  A  linear  system  with  delays  may  be  always  cast  in  state  space  form, 
by  means  of  a  standard  state  augmentation  procedure.  In  this  context,  the 
augmented  state  at  time  n  should  incorporate  all  messages  that  have  been  trans¬ 
mitted  but  not  yet  received.  Without  any  assumptions  on  communication  delays, 
this  might  require  a  state  space  of  unbounded  dimension.  We  note,  however,  a  few 
special  cases. 

1.  A  finite-dimensional  state  space  representation  is  possible  if  each 

is  finite  dimensional  and  the  total  number  of  messages  that  have  been  transmitted 
before  time  n  and  have  not  been  received  until  time  n  is  bounded  by  some  constant 
independent  of  n.  The  augmented  state  consists  of  all  these  messages,  together 
with  the  values  of  x^n),  for  each  i. 

2.  The  boundedness  condition  above  certainly  holds  if  communication  delays 
are  bounded  by  some  Bq.  In  such  a  case,  we  might  consider  the  augmented  state 


(x1 (n) , . . . ,xM(n) ; . . . jx1 (n-B  ) , . . . ,xM(n-B  ))  . 
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3.  Finally,  if  communication  delays  are  zero,  then  t^  (n)=n,  whenever 
n6T^ .  In  this  case,  (5.2.2)  is  already  a  linear  system  is  standard  state 
space  form.  (Compare  with  Example  V  and  equation  (5.2.6).)  The  state  vector 
is  (x1(n),...,xM(n))GHM  and  we  may  obtain  easily 


•  ri  .  •  • 

x1(n+l)  =  £  i1’*  (n+1  |n-l)x')  (n)  +  y  (n)s  (n) 

3-1 


(5.2.20) 


M 


xX(n+l)  =  1  $1-)  (n+1  |m-l)xJ  (m)+  £  (n+1 1 k)YJ  (kjs-1  (k)  ,  m<n  , 

j=l  k=m  , 


M 

y(n)  =  l  <tJ  (n-l)xJ  (n) 
3-1 


(5.2.21) 

(5.2.22) 


We  derive  below  a  few  inequalities  pertaining  to  the  zero-delay  case  which 
will  be  used  in  Section  5.4. 

Lemma  5.2.2:  Assume  that  communication  delays  are  zero  and  that  lim  $1J(n|k) 

n-+°° 

exists,  for  all  i,j,k,  and  is  denoted  by  ‘f’1  (k) .  Then, 


M  M 

(i)  l  $XJ(k|n)  =  l  $][(n)=l,  vi,k,n,l. 

j=l  j=l 

1  M 

Let  x  ,...,x  GH.  Then, 


(5.2.23) 


•  ri  •  *  • 

(ii)  max  |  | x1  -  £  4>'5(n)x',||  £  max  ||xX-xJ 

i  j=l  i»j 


(5.2.24) 


(iii)  M  l  ^(n)x^||  <  max  1 1 x1  [  | , 
j  =  l  i 


vn. 


(5.2.25) 


(iv)  max  |  |  l  $  J(k|n)xJ||  <  max  |  |x  ||,  Vk,n 

i  j=l  i 

M  /  i,j  iJ  \  a  i  i 

(v)  M  1  ($  (k|n)  -  «  (k|n)  1  xJ  1 1  <  max||x  -xJ | | ,  vk.n.i^i, 

j=l  '  '  i,j 

M  /  i,  3  1J  \  i 

(vi)  ||  (k|n)  -  $  (k|n)J  xJ||< 

j=lV  ' 

<_  2M  max  ||$13(k|n)  -  $3  (n)  |  |max|  |x1-x^  |  | ,  vi^.i^.k.n 
i>3  i  J 


(5.2.26) 


(5.2.27) 


(5.2.28) 


Proof: 

(i)  Suppose  that/ for  all  if  x1(k)=0,  Vk<n,  that  y1 (n)sX(n)=z,  (for  some 
z£H^)  and  that  sX(k)=0,  vk^n.  It  follows  that  x1(n+l)=z,  vi  and,  by  induction, 

M 

we  obtain  xX(k)=z,  vk>n,  Vi.  Therefore,  \  $X3 (k|n) z=z,  vk>n,  vi. 

3=1 

Since  z  was  arbitrary,  we  obtain  (5.2.23). 

M  . 

(ii)  Using  (5.2.23)  and  (5.  2.8),  it  follows  that  \  <&3(n)x3  belongs  to  the 

3=1  1 

1  M 

convex  hull  of  {x^, ...,x^},  for  any  component  Jt.  Therefore, 


.  n  .  • 

max  ||x\-  l  $^(n)x3||  <  max  |  jx^  -  x3| 
i  3=1  i»j 


(5.2.29) 


We  then  take  the  maximum  with  respect  to  %  and  recall  that  we  are  using  the 
max-norm  to  obtain  (5.2.24). 


(iii)  By  convexity,  again 


1*1 

I  I  l  *|[(n)x3  |  I  -  max  I  I xjJ  I  ’ 


and  the  conclusion  follows  similarly,  as  in  part  (ii) . 


(5.2.30) 


(iv)  The  proof  is  identical  with  that  of  part  (iii)  . 


r  V  ,  V  ,  ] 

(v)  For  any  component  Z,  note  that  I  (k|n)  -  (k,n)  ]x. 

j=l 

is  the  difference  between  two  elements  in  the  convex  hull  of 

Hence  it  is  bounded  by  max  | [x^  -  x^| | .  Taking  the  maximum  over  Z,  we  recover 


h] 


(5.2.27) 


(vi)  Note  that,  for  any  i,i^,k,n. 


M  i  i 

£  S’13  (kjn)x  3  =  x  3  , 


(5.2.31) 


j-1 


due  to  (5.2.23).  Hence,  for  any  i^e{l, . . . ,M} , 


i. 


M  i.j  i  j  M  i  j  i  j 

|  l  (k| n) -  <i>  (k|  n)  ]x3  |  j  =  \  \  l  1$  (k|n)-  $  (k|n)][x3-x  ]||< 

j-1  j-1 


M  /  i.j  , 

—  1  (11$  (k|n)-$J (n 

j-1  V 


•i  V 

)  I  I  +]  |$J  (n)-  <t»  z  (k 


n)  | | ^  || x3-x 


<_  2M  max  |  |  S’13  (k|n)  -  03  (n)  |  |  max  |  jx^-x3  |  |  . 

i.j  i.j 


(5.2.32) 


The  model  of  computation  introduced  in  this  Section  may  be  generalized  in 
several  directions  [Bertsekas,  Tsitsiklis  and  Athans,  1984].  To  name  a  few 
examples,  the  updating  rules  of  each  processor  need  not  have  the  linear  structure 
of  equation  (5.2.2);  also,  it  may  be  convenient  to  communicate  other  information, 
besides  the  values  of  components  (e.g.  derivatives  of  the  cost  function;  see 
Section  5.6).  In  the  most  general  case,  we  would  expect  that  a  model  as  general  as 
thatof  Section  2.1  might  be  needed.  However,  except  for  Section  5.6  (which  treats 
a  special  case),  the  present  model  is  sufficient  for  our  purposes. 


There  is  a  large  number  of  well-known  centralized  deterministic  and 


stochastic  optimization  algorithms  which  have  been  analyzed  using  a  variety 
of  analytical  tools  [Avriel,  1976;  Poljak  and  Tsypkin,  1973;  Ljung,  1977a; 

Kushner  and  Clark,  1978;  Poljak,  1976,  1977,  1978].  A  large  class  of  them,  the 
so-called  "pseudo-gradient"  algorithms  [Poljak  and  Tsypkin,  1973]  have  the  distin¬ 
guishing  feature  that  the  (expected)  direction  of  update  (conditioned  upon  the 
past  history  of  the  algorithm)  is  a  descent  direction  with  respect  to  the  cost 
function  to  be  minimized.  Examples  I-IV  of  Section  5.2  certainly  have  such  a 
property.  A  larger  list  of  examples  is  provided  by  Poljak  and  Tsypkin  [1973] 
who  also  show  that  the  development  of  results  for  pseudo-gradient  algorithms 
leads  easily  to  results  for  broader  classes  of  algorithms,  such  as  Kiefer-Wolfowitz 
stochastic  approximation.  In  this  section  we  present  convergence  results  for  the 
natural  decentralized  asynchronous  versions  of  pseudo-gradient  algorithms.  We 
adopt  the  model  of  computation  and  the  corresponding  notation  of  Section  5.2. 

We  allow  the  initialization  {x^ (1) , . . . ,xM (1) }  of  the  algorithm  to  be  random 
with  finite  mean  and  variance.  We  also  allow  the  updates  s1 (n)  and  the  step-size 
yX(n)  of  each  processor,  the  combining  coefficients  a^ (n) ,  the  times  of  trans¬ 
mission,  reception  and  computation  to  be  random.  (So,  t^3 (n)  and  the  sets 
T^,  T^  are  random.)  We  assume  that  all  random  variables  of  interest  are  defined 
on  some  probability  space  (ft,F,p/  .  We  also  introduce  an  increasing  sequence  {F  } 


t  In  the  case  where  H  is  infinite  dimensional  we  must  be  precise  on  the  meaninq  of 

measurability.  We  use  the  concept  of  strong  measurability  [Yosida,  1980] :  a  function 

x:S>*H  is  called  F-measurable  if  and  only  if  there  exists  a  sequence  {x  }  of  finitely 

valued  functions  x  :J>+H  such  that  lim  x  (w)=x(w),  for  almost  all  wefl.  (The  limit  is 
n  n 

rr*» 

is  with  respect  to  the  norm  on  H.)  A  similar  definition  applies  concerning  meas¬ 
urability  with  respect  to  any  other  a-field  F^C  F. 


of  a-fields  contained  in  F,  where  is  a  a-field  describing  the  history  of  the 

algorithm  up  to  time  n.  In  particular,  we  define  F  as  the  smallest  a-field  such 

n 

that  x1 (m) ,  y1 (m) ,  m<n  and  s1(m),  m<n,  are  F  -measurable,  for  each  i,  such  that 

—  n 

events  neT^3 ,  n6T^3 ,  belong  to  F  ,  for  each  i,j,£,n  and  such  that  t^3 (n)  ,  a^3 (n) 
are  Fn~measurable ,  for  any  nGT^3 ,  for  any  i,j,£. 

Notice  that  in  the  above  setting,  (n|k)  becomes  a  random  variable,  deter¬ 
mined  by  the  random  sequence  of  communication  times ,  delays  and  combining 
coefficients.  Any  interesting  assumptions  of  a  statistical  nature  on  the  commu¬ 
nicating  and  combining  process  have  direct  counterparts  in  terms  of  $^3(n|k). 

While  we  would  like  to  allow  as  much  randomness  in  the  algorithm  as  possible, 
certain  restrictions  have  to  be  imposed.  Namely,  we  have  to  avoid  the  possibility 
that  a  processor  chooses  to  transmit  at  those  times  that  it  receives  a  "bad 
measurement"  s‘L(n)  or  that  it  tends  to  give  larger  weights  to  bad  measurements  when 
forming  convex  combinations.  This  may  be  avoided  in  two  ways:  either  by  not 
allowing  the  processors  to  base  their  transmission  decisions  on  the  state  of  the 
algorithm,  or  by  somehow  ensuring  that  the  long  run  outcome  is  independent  of 
such  decisions.  This  suggests  the  following  assumption. 

Assumption  5.3.1:  (i)  The  limit  $3  (k)  of  $13(n|k),  as  n-*»,  exists  and  is  indepen¬ 

dent  of  i,  with  probability  1.  (This  assumption  will  be  strengthened  later.) 

(ii)  For  any  ie{l,...,M}  and  any  m,n  such  that  m<n,  (m)  and  s1(n)  are  condi¬ 
tionally  independent,  given  F  . 

n 

The  main  cases  in  which  this  assumption  holds  are  the  following: 

1.  If  interprocessor  communications  and  the  combining  coefficients  are  modelled 
as  being  deterministic.  This  does  not  mean  that  they  have  to  be  known  in  advance 
but  forbids  the  processors  to  decide  whether  to  transmit  by  looking  at  the  value 
of  a  random  update  s1  (n)  or  of  their  state  vector  x1 (n) . 


2.  More  generally,  if  interprocessor  communications  and  the  combining 
coefficients  are  random  but  are  generated  by  an  "exogenous"  source. 

3.  If  sX(n)  is  a  deterministic  function  of  x1 (n) .  In  such  a  case,  s1(n), 
conditioned  upon  F  ,  is  a  constant,  hence  independent  of  4>1(k),  for  all  k. 

4.  If  $1(k)  is  deterministic,  for  all  k.  This  is  much  weaker  than  assuming  that 
$1(n|k)  is  deterministic,  or  that  interprocessor  communications  are  deterministic 
For  example,  in  the  specialization  case,  (k)  is  guaranteed  to  be  deterministic. 

In  terms  of  the  Examples  of  Section  5.2,  the  assumption  of  deterministic 
transmission  and  reception  times  may  be  relaxed,  except  for  Example  IV.  This  is 
because  $1 (k)  is  deterministic  in  Examples  I,III,V,  and  s1  (n)  is  a  deterministic 
function  of  x1 (n)  in  Example  II . 

We  assume  that  the  objective  of  the  algorithm  is  to  minimize  a  nonnegative 
cost  function  [0,°°)  .  For  the  time  being,  we  only  assume  that  J  is  a  smooth 

function.  In  particular,  J  is  allowed  to  have  several  local  minima. 

Assumption  5.3.2:  J  is  Frechet  differentiable  and  its  derivative  satisfies  the 
Lipschitz  condition 

| |Vj(x)-VJ(x') | |  <K||x-x’||,  Vx,  x*eH  ,  (5.3.1) 

where  K  is  some  nonnegative  constant. 

Remark ;  Since  we  allow  the  possibility  of  infinite  dimensional  spaces,  Vj (x) 
should  be  viewed  as  an  element  of  H*,  the  dual  of  H.  As  usual  H*  is  endowed  with 
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the  norm  inherited  from  H.  Similarly,  we  denote  by  V^J(x)  the  derivative  of  J 
with  respect  to  x^.  This  partial  derivative  belongs  to  H*,  the  dual  of  H^. 

In  the  finite  dimensional  case,  however,  VJ (x)  and  (x)  may  be  just  viewed  as 
vectors  in  H  and  respectively.  In  fact,  with  the  notation  we  employ  below, 
they  should  be  viewed  as  row  vectors. 

Assumption  5.3.3:  The  updates  s1 (n)  of  each  processor  satisfy 


V^J(x  (n))E[s^(n) 


F  ]<  0, 
n  — 


a.s. ,  Vi i 1 ,n . 


(5.3.2) 


This  assumption  states  that  each  component  of  each  processor's  update  is  in 
a  descent  direction,  when  conditioned  on  the  past  history  of  the  algorithm  and 
is  satisfied  by  Examples  I-IV  of  Section  5.2. 

As  an  immediate  consequence  of  (5.3.2)  we  obtain 


E [VJ (x1 (n) ) (n) s1 (n) | F  ]=  £  (n) | F  ]V- J (x1 (n) )E[Sj (n) | F  ]  <0  a.s. 

n  jl=l  *  n  *  n  ~ 


(5.3.3) 


It  turns  out  that  (5.3.3)  is  all  that  is  required  in  our  proofs.  On  the  other 
hand,  it  can  be  shown  by  means  of  simple  examples  that  the  condition 

E [VJ (x1 (n) ) s1 (n) I F  ]<0,  a.s. 

n  — 

is  not  sufficient  for  proving  convergence. 

The  next  assumption  is  easily  seen  to  hold  for  Examples  I  and  II  of  Section 
5.2.  For  stochastic  algorithms,  it  requires  that  the  variance  of  the  updates 
(and  hence  of  any  noise  contained  in  them)  goes  to  zero,  as  the  gradient  of  the 
cost  function  goes  to  zero. 


(5.3.4) 


Assumption  5.3.4:  For  some  K  >0  and  for  all  i,£,,n, 

-  o  — 

E[|  |sj(n)  |  |2]<  -  KQE[VJlJ(x1(n))sJ(n)]  . 

As  a  matter  of  verifying  Assumption  5.3.4,  one  would  typically  check  the 
validity  of  the  slightly  stronger  condition 

E[|  |Sp  (n)  |  |  2  |  F  ]<  -  K  V0J(xi(n))E[sSn)  |F  ]  . 

''ft  '  '  '  n  —  ox.  i  'n 

Notice  that,  using  Lemma  5.2.1  (ii)  and  Assumption  5.3.4  we  obtain 

L  ... 

I  EtV^JCx1  (n)  )$^(n)  s^(n)  ] 

=  -  K  E(VJ(x1(n))$1(n)s1(n)],  (5.3.5) 

o 

where  K  =K  /r\  >0.  It  turns  out  that  (5.3.5)  is  all  we  need  for  our  results  to 
o  o  — 

hold. 

Our  first  convergence  result  states  that  the  algorithm  converges  in  a  suitable 
sense,  provided  that  the  step-size  employed  by  each  processor  is  small  enough  and 
that  the  time  between  consecutive  communications  is  bounded,  and  applies  to 
Examples  I  and  II  of  Section  5.2.  It  should  be  noted,  however,  that  Theorem  5.3.1 
(as  well  as  Theorem  5.3.2  later)  does  not  prove  yet  convergence  to  a  minimum  or  a 
stationary  point  of  J.  In  particular,  there  is  nothing  in  our  assumptions  that 
prohibits  having  s1(n)=0,  Vi,n.  Optimality  is  obtained  later,  using  a  few  auxiliary 
and  fairly  natural  assumptions  (see  Corollary  5.3.1). 


Theorem  5.3.1:  Let  Assumptions  5.2.1,  5.2.2,  5.2.4  hold,  with  probability  1, 

and  assume  that  the  constants  (B  , B .,a)  involved  in  them  are  deterministic.  Let 

o  1 

also  Assumptions  5. 3. 1-5. 3. 4  hold.  Suppose  that  y1  {n)>0  and  that  sup  YX(n)<Y  <ao 
(Here,  YQ  is  deterministic)  .  i»n 

There  exists  a  constant  Y*>0  (depending  on  the  constants  introduced  in  the 

Assumptions)  such  that  the  inequality  0<Yq<Y*  implies: 

a)  J(x1(n)),  i=l,2,...,M,  as  well  as  J(y(n)),  converge  almost  surely,  and 
to  the  same  limit. 

b)  lim(x1(n)-x"1  (n) )  =  limtx1 (n) -y (n) ) =0 ,  Vi,j,  almost  surely  and 

n-K»  n-x» 

in  the  mean  square. 

c)  The  expression 

°°  M 

l  l  Y1(n)Vj(x1(n))E[s;L(n)  |F  ]  '  (5.3.6) 

n=l  i=l 

is  finite,  almost  surely.  Its  expectation  is  also  finite. 

Leaving  technical  issues  aside,  the  idea  behind  the  proof  of  the  above 
(and  the  next)  Theorem  is  rather  simple:  the  difference  between  y(n)  and  x1 (n) , 
for  any  i ,  is  of  the  order  of  BYq ,  where  B  is  proportional  to  a  bound  on 

communication  delays  plus  the  time  between  consecutive  communications  between 

processors.  Therefore,  as  long  as  Y  remains  small,  VJ(xX(n))  is  approximately 

o 

equal  to  VJ(y(n))?  hence,  s^(n)  (and  consequently  $1(n)s1(n))  is  approximately 
in  a  descent  direction,  starting  from  point  y(n) .  Therefore,  iteration  (5.2.19) 
is  approximately  the  same  as  a  centralized  descent  (pseudo-gradient)  algorithm 
which  is,  in  general  convergent  [Poljak  and  Tsypkin,  1973] . 


Remark  on  Notation:  In  the  course  of  the  proofs  in  this  section,  we  will  use 


the  symbol  A  to  denote  non-negative  constants  which  are  independent  of  n,  y  , 

y1(n),  x1 (n) ,  s1(n)  etc.,  but  which  may  depend  on  the  constants  introduced  in  the 
various  assumptions  (that  is,  M,  L,  K,  KQ,  BQ,  B^,  a,  etc.).  When  A  appears  in 
different  expressions,  or  even  in  different  sides  of  the  same  equality  (or 
inequality) ,  it  will  not  necessarily  represent  the  same  constant.  (With  this 
convention,  an  inequality  of  the  form  A  +  1<_A  is  meaningful  and  has  to  be  inter¬ 
preted  as  saying  that  A+l,  where  A  is  some  constant,  is  smaller  than  some  other 
constant,  denoted  again  by  A.)  This  convention  is  followed  so  as  to  avoid  the 
introduction  of  unnecessarily  many  symbols. 


Proof  of  Theorem  5.3.1:  We  introduce  a  new  increasing  sequence  {F°}  of 
-  n 

a- fields  contained  in  F.  In  particular,  we  let  F°  be  the  smallest  a-field 

containing  F  and  such  that  $J’(k)  is  F°-measurable ,  for  all  k<n.  Equation 
n  n  — 

_o 

(5.2.18)  then  implies  that  y(n)  (and,  therefore,  J(y(n)))  is  F  -measurable. 

n 

Using  the  fact  that  (n)  is  F°-measurable,  we  obtain 

n 


E[VJlJ(xi(n))$^(n)s^(n)  |F°]  = 

$^(n)V^J(xi(n))E[s^(n) |F°]  = 

i  i  i  i  (5.3.6c 

$^(n)V^j(x  (n) )E[s^(n) | Fn) , 

where  the  last  equality  follows  from  Assumption  5.3.1.  Using  Assumption  5.3.3 
we  conclude  that 

E[VJ(xi(n) )$1(n)s^ (n) |P°]<0,  a.s.  (5.3.7) 


Without  loss  of  generality,  we  will  assume  that  the  algorithm  is  initialized 
so  that  x1(l)-0,  Vi.  In  the  general  case  where  x1 (1)5*0,  we  may  think  of  the 
algorithm  as  having  started  at  tim=>  0,  with  x1(0)=0;  then,  a  random  update 
s1(0)  sets  x1(l)  to  a  nonzero  value.  So,  the  case  in  which  the  processors  initially 
disagree  may  be  easily  reduced  to  the  case  where  they  initially  agree. 

Note  that  we  may  define  s^n)  =  — — s1  (n)  and  view  sL(n)  as  the  new 

Y0 

step  with  step-size  y  .  It  is  easy  to  see  that  Assumptions  5.3.3  and  5.3.4  also 
hold  for  s  (n) .  For  these  reasons,  no  generality  is  lost  if  we  assume  that 

YX(n)  =  yQ,  vn  and  this  is  what  we  will  do. 

Let  us  define 

M 

b (n)  =  l  | | s1 (n) | |  (5.3.8) 

i=l 

and  note  that 

M 

b2(n)  <M  £  | | s1 (n) | j 2<  Mb2 (n)  . 
i=l 

Using  (5.2.7),  (5.2.18)  and  Lemma  5.2.1  (iii) ,  we  obtain 

||y(n)-x1(n)||<  \  l  YJ  I*3  (n|k)  |  ||  | s3  (k)  |  |  < 

k=l  j=l  u 

n-1 

£  AY0  I  dn  b (k)  . 


(5.3.9) 


From  a  Taylor  series  expansion  for  J  we  obtain 


'  (y  (n+1) )  =  J^y(n)+YQ  £  $i(n)si(n )j  < 


M 

.1  .  .  1 . 


M 


<J(y(n))  +yQVJ(y(n))  £  4*1  (n)  s1  (n)+  a[  jYQ  l  ^  (n)  s1  (n)  |  | 2 

i=l  i=l 


M 


<  J(y(n))  +YQVJ(y(n))  £  $1(n)s1(n)  +  AY^b*  (n)  . 


2,  2 


i=l 


(5.3.10) 


Assumption  5.3.3  is  in  terms  of  VJfx^n)),  whereas  above  we  have  VJ(y(n)).  To 
overcome  this  difficulty,  we  use  the  Lipschitz  continuity  of  the  derivative  of  J 
and  invoke  (5.3.9)  to  obtain 


M  M  . 

|  |  VJ  (y  (n) )  l  $'L(n)s1(n)-  £  VJ  (x1  (n) )  (n)  s1  (n)  |  |  < 

i=l  i=l 

M 

<  A  l  | | y  (n) -x1 (n) | |  | | s1 (n) | |  < 


n-1  M  n-1 

<Y0A  l  d  b(k)  l  | |sX(n) | |=  Y0A  l  d  b(k)b(n)  < 
k=l  i=l  k=l 

nr1  n-k  2  2 

<YQA  l  d  [b  (k)  +d  (n)  ]  . 

k=l 


Let  us  define 

G^n)  =  -VJ(xi(n))$i(n)si(n)  , 
M 

G(n)  =  l  G1(n), 
i=l 


(5.3.11) 

(5.3.12) 

(5.3.13) 


and  note  that  (5.3.3)  implies  that  E[G(n)J>  0.  We  now  rewrite  inequality 
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(5.3.19) 


By  Assumption  5.3.3  and  (5.3.6a),  E[G<k)  |F  ]>0,  Vk;  we  may  apply  the  monotone 
convergence  theorem  to  (5.3.19)  and  obtain 

(5.3.20) 

(5.3.21) 


Now  use  the  fact  (Lemma  5.2.1  (ii) ,  inequality  (5.2.10))  that  $£(n)>n>0f  for 
any  computing  processor  i  for  component  Si.  This  implies  that 

OO 

£  V.J(x1(n) )E[Sy (n) |F  ]>-»,  a.s. 
k=l  x,  n 

and  establishes  part  (c)  of  the  theorem. 


Lemma  5.3.1;  Let  X (n) ,  Z (n)  be  non-negative  stochastic  processes  (with  finite 

expectation)  adapted  to  {f°}  and  such  that 

n 


E [X (n+1) |f°]<  X (n)  +  Z(n) , 
n  — 


(5.3.22) 


l  E [Z  (n)  ]  <  00  .  (5.3.23) 

n=l 


Then  X(n)  converges  almost  surely,  as  rr*00. 


Proof  of  Lemma  5.3.1;  By  the  monotone  convergence  theorem  and  (5.3.23)  it 


00 

follows  that  £  Z  < 00 ,  almost  surely.  Then,  Lemma  5.3.1  becomes  the  same 
n=l 

as  Lemma  4.C.1  in  [Ljung  and  Soderstrom,  1983,  p.453] 

Now  let  A  be  the  constant  in  the  right  hand  side  of  (5.3.14)  and  let 


Z(n) 


Then,  Z(n)>  0  and  by  (5.3.15) 


(5.3.24) 


E[Z(n)]<  A  l  dn_kE[G(k)l  . 
k=l 


(5.3.25) 


Therefore , 


oo  n 


l  E[Z(n))<A  l  l  dn“kEtG(k))  = 
n=l  n==l  k=l 


OO 

=  A  I  E  [G  (k)  ]  <°°  , 

1-d  , _  , 


(5.3.26) 


k=l 


where  the  last  inequality  follows  from  (5.3.19).  Therefore,  Z(n)  satisfies 


(5.3.23).  We  take  the  conditional  expectation  of  (5.3.14),  given  F  .  Note  that 


J(y(n))  is  F°-measurable  and  that  E[G(n) |f°]>_  0.  Therefore  Lemma  5.3.1  applies 


and  J(y(n))  converges  almost  surely. 

Using  Assumption  5.3.4  once  more,  together  with  (5.3.19), 


[00  “|  00 

I  b2(k)  <  A  l 

k=l  J  k=l 


E [G (k)  )<  <» 


(5.3.27) 
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which  implies  that  b(k)  converges  to  zero,  almost  surely.  Recall  (5.3.9) 

to  conclude  that  y (n) -x1 (n)  converges  to  zero,  almost  surely.  Also,  by 

2 

squaring  (5.3.9),  taking  expectations  and  using  the  fact  that  E[b  (k) ] 

2 

converges  to  zero,  we  conclude  that  E[||y(n)-x  (n)||  ]  also  converges  to  zero, 
and  this  proves  part  (b)  of  the  Theorem. 

Lemma  5.3.2:  Under  Assumption  5.3.2,  there  exists  some  A>0  such  that 

i  2 

|  |  Vj  (x)  I  i  <_AJ(x),  V  x6H  .  (5.3.28) 

Proof ;  By  the  definition  of  the  induced  norm  on  H* 

l|7,u’ll-E  'iff' 

Y*0 

Therefore,  for  any  e>0,  there  exists  some  y6H  such  that 

Vj(x)y  >/l-£) | |VJ(x) |  j  | |y| | .  Moreover  y  can  be  scaled  so  that 

r 

| | y | |  =  ~  | |VJ(x) j  | .  Then,  a  second  order  expansion  for  J  yields 
L 

J(x-y)<  J(x)  -  |  |VJ(x)  |  |2  +  |  |VJ(x)  |  |2  .  (5.3.29) 

Taking  e<  j  and  using  the  assumption  that  J  is  a  nonnegative  function, 

(5.3.28)  follows. □ 

Since  J(y(n))  converges,  it  is  bounded;  hence  Vj(y(n))  is  also  bounded,  by 

(5.3.28) .  We  then  use  the  fact  that  ytnj-x^tn)  converges  to  zero,  to  concludes 
that  J (x1 (n) ) -J (y (n) )  also  converges  to  zero.  This  proves  part  (a)  and  concludes 
the  proof  of  the  Theorem.* 


Remark : 


The  choice  of  y*  in  our  proof  is  not  tight  enough.  A  tighter  value  may 

be  obtained  in  terms  of  the  data  of  the  problem  (that  is,  in  terms  of  M,L,K,KQ, 

etc.)  by  tracing  back  the  inequalities  which  led  to  (5.3.17).  Finally,  the 

conclusions  of  the  Theorem  remain  true,  with  the  same  Y*,  even  if  we  replace 

the  condition  Yg£Y*  by  the  weaker  requirement  lim  sup  y^nj^y*. 

n-*» 

Decreasing  Step-Size  Algorithms 

We  now  introduce  a  different  set  of  assumptions.  We  allow  the  magnitude 
of  the  updates  s1(n)  to  remain  nonzero,  even  if  Vj (x1  (n) )  is  zero  (Examples 
III  and  IV  of  Section  5.2) .  Such  situations  are  common  in  stochastic 
approximation  algorithms  or  in  system  identification  applications.  Since  the 
noise  is  persistent,  the  algorithm  can  be  made  convergent  only  by  letting  the 
step-size  y1  (n)  decrease  to  zero.  The  choice  y1  (n)  =  1/n  is  most  commonly  used 
and  in  the  sequel  we  will  assume  that  y1 (n)  decreases  at  least  as  fast  as  1/n. 
Note  that  even  if  each  processor  selects  its  step-size  according  to  a  local  clock 
or  counter  (which  may  be  itself  random) ,  as  long  as  these  clocks  do  not  operate 
in  different  time  scales,  we  may  assume  that  Y1(n)£  A/n,  for  all  i  and  for  some 
A>0. 

Since  the  step-size  is  decreasing,  the  algorithm  becomes  progressively 
slower  as  n-*».  This  allows  us  to  let  the  communications  process  become 
progressively  slower  as  well,  provided  that  it  remains  fast  enough,  when 
compared  with  the  natural  time  scale  of  the  algorithm,  the  latter  being  deter¬ 
mined  by  the  rate  of  decrease  of  the  step-size.  Such  a  possibility  is  captured 


by  Assumption  5. 2. .3. 
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The  following  assumption,  intended  to  replace  Assumption  5.3.4,  allows 
the  noise  to  be  persistent.  Similarly  with  Assumption  5.3.4,  it  could  be 
more  naturally  stated  in  terms  of  conditional  expectations,  but  such  a 
stronger  version  turns  out  to  be  unnecessary. 

Assumption  5.3.5:  For  some  K  /K1,K.,>0,  . 

-  olz  — 

E[  |  |  s^  (n)  |  |  2]<  -^EtV^JCx1  (n) )  s^(n)  ]  +  K^E  [J  (x1  (n) )  ]  +  K2  . 

In  fact  we  will  only  use  the  following  immediate  consequence  of 
(5.3.30)  and  Lemma  5.2.1  (iii) .  (Compare  with  inequality  (5.3.5).) 

E[ | | s1 (n) | | 2]<-KQE[VJ(xi (n) (n)s1 (n) )+  K^ElJtx* (n) ) )  +  K2  . 

Theorem  5.3.2:  Let  Assumptions  5.2.1,  5.2.3,  5.2.4  hold,  with  probability  1, 

and  assume  that  the  constants  B  ,B, ,a,$  involved  in  them  are  deterministic. 

o  1 

Let  also  Assumptions  5.3.1,  5.3.2,  5.3.3,  5.3.5  hold.  Assume  that,  for  some 
K^>0,  y1  (n);<  K3/n,  Vn,i.  Then, 

a)  J(x1(n)),  i=l,2,...,M,  as  well  as  J(y(n))  converge  almost 
surely,  and  to  the  same  limit. 

b)  lim  (x1 (n) -x3 (n) )  =  lim  (x1 (n) -y (n) ) =0 ,  vi»jr 

n-*»  n-*=° 

almost  surely  and  in  the  mean  square. 

c)  The  expression 

®  m 

£  £  y1 (n)VJ(x1 (n) )E[sL (n) |f  )  (5.3.32) 

n=l  i=l  n 


(5.3.30) 


(5.3.31) 


is  finite,  almost  surely.  Its  expectation  is  also  finite. 


For  the  proof  of  Theorem  5.3.2,  we  will  start  with  an  auxiliary 


Lemma.  In  this  Lemma  we  will  bound  certain  infinite  series  by  correspond¬ 
ing  infinite  integrals.  This  is  justified  as  long  as  the  integrand  cannot  change 

by  more  than  a  constant  factor  between  any  two  consecutive  integer  points. 

n^-k6 

For  notational  convenience,  we  use  c{n|k)  to  denote  d  ,  where  d  and  6  are 
as  in  Lemma  5.2.1  (iv) . 


Lemma  5.3.3:  The  following  hold : 


(i) 


00  00 


I  I  ^  *=<"110 

k=l  n=k 


<00 


(5.3.33) 


(ii)  Let  a>0.  Then,  there  exists,  some  A>0  such  that 


k=l  k 


2^-  c(n|k)<  An  ^~Ct,  Vn>l  . 


(5.3.34) 


n  n 

(iii)  lim  £  ~~2  G(n|k)  =  lim  £  —  c(n|k)=0  . 


n-*»  ic=i  k 


n-*»  k=l 


(5.3.35) 


(iv)  lim  £  —  c(n|k)=0  . 


k-*»  n=k 


(5.3.36) 


i-1 

6  1  /6  6  "** 

Proof  of  Lemma  5.3.3:  Let  t  =y;  then  t=y  and  dt=(l/5)y  dy.  Notice 


that  the  left  hand  side  of  (5.3.34)  is  bounded  by 


J  **  _t  dt  -  *“  /  4 


n  -y  1  6 

d  j  y  dy  = 


(2+a)  /&  6 


/  5  6  n  -y  -a- 6 , 

=  An  I  y  d  ^dy  <  An 


which  proves  part  (ii) ;  part  (iii)  follows  immediately,  by  letting  a=0  in 
(5.3.34).  For  part  (i) ,  we  notice  that  the  left  hand  side  of  (5.3.33)  is 
bounded  by 


A  I  “  I  7  c(n|k)<  A  l  n-1'6 

n=l  k=l  k  n=l 


where  the  first  inequality  follows  from  (5.3.34)  with  a=0.  Finally,  part  (iv) 
is  an  immediate  consequence  of  part  (i) .q 


Proof  of  Theorem  5.3.2:  Using  the  same  arguments  as  in  the  proof  of  Theorem 

5.3.1,  we  may  assume,  without  loss  of  generality,  that  x"L(l)=0  and  that 

yi(n)=l/n,  Vi,n.  (Otherwise  we  could  define  "g1 (n)  =  nY1 (n) s1 (n) . )  Moreover, 

we  define  the  a-field  F°  as  in  that  proof  and  note  that  (5.3.7)  is  again  valid. 

n 

n6-k5  i 

We  still  use  c(n|k)  to  denote  d  .  We  define  again  b  (n) ,  G  (n) ,  G(n) 

by  (5.3.8),  (5.3.12),  (5.3.13),  respectively,  as  in  the  proof  of  Theorem  5.3.1. 
Also,  let 


d>  (n  k)  =  ft  —  r- 


(  —  +  r  - 

I  2  ^  n 

I  n  m=l 

1  —  r  c(n|k)  , 


—  —  c  (n|  m)  ,  n=k, 

n  m  ' 


(5.3.37) 


-157- 


By  replicating  the  steps  leading  to  inequality  (5.3.14)  in  the  proof  of 
Theorem  5.3.1  and  using  the  definition  (5.3.37)  we  obtain,  for  some  A>0, 


1  o 

J(y(n+1))<  J(y(n))  --G(n)  +A  £<j>(n|k)b  (k)  ,  V n  . 

n  k=l 


(5.3.38) 


Taking  expectations  in  (5.3.38),  we  have 


1  2 

E  [  J  (y  (n+1) )  ]  <_  E  [  J  (y  (n)  8  -  —  E  [G  (n)  ]  +A  £<j>(n|k)E[b  (k)  ] ,  Vn.  (5.3.39) 


We  would  like  to  use  Assumption  5.3.5  to  bound  E[b  (k) ] .  However,  these 
bounds  are  in  terms  of  E[J(x1(n))],  while  (5.3.39)  involves  E[J(y(n))]. 
Nevertheless,  we  have: 


Lemma  5.3.4:  There  is  a  constant  A>0  such  that,  for  all  n>l, 

2  n_1  2 

E [b  (n)]<A  l  c(n|k)E[b  (k) ]  +  AE  [G (n) ]  +AE[J(y(n))]  +A  (5.3.40) 

k=l  k 


Proof  of  Lemma  5.3.4:  Assumption  5.3.5  may  be  written  as 


E  [b  (n)  ]<_  AE[J(y(n) )  ]  +  AE[G(n)]  +  A  +  A  £  E  [J  (x1  (n) ) -J  (y  (n) )  ] ,  (5.3.41) 

i=l 

and  we  need  to  bound  the  last  term. 

Using  Lemma  5.3.2  and  a  second  order  series  expansion  for  J  we  obtain: 

J(xi (n) )-J(y(n) )<  ||VJ(y(n))||  | | xi (n) -y (n) | I  +  A | | x1 (n) -y (n) | | 2  < 


±  \  |  |VJ(y(n))  |  |2  +  A|  |xi(n)-y(n)  |  |2  <  AJ(y(n))  +  a|  j  xi  (n)  -y  (n)  |  |  2  .  (5.3.42) 


Similarly  with  (5.3.9),  we  have 


n-i 

lyfnj-x^n)  |  |  <  A  £  -c(n|k)b(k)  . 


We  now  use  (5.3.43)  to  bound  the  last  term  in  (5.3.42).  We  obtain 


(5.3.43) 


lxi(n)-y(n) II 2  < 


ill  * 


v  c (njk)b  (k)  < 


11-1  2 ,  | ,  _  n- 

<  An  l  C  (n|k)  b2  (k)  <  An  £ 
k=l  k  k= 


Y  b2(k)  . 

k=l  k 


(5.3.44) 


We  now  take  expectations  in  (5.3.42)  and  use  (5.3.44)  to  recover  (5.3.40). 
This  completes  the  proof  of  Lemma  5.3.4.Q 

2  2 

Inequality  (5.3.40)  bounds  b  (n)  in  terms  of  past  values  of  b  (k) .  We 
may  recursively  eliminate  such  past  values  and  obtain: 


Lemma  5.3.5:  There  exists  a  finite  constant  A  and  a  kernel  g(n,k)  such  that 


E (b  (n)  ]<  A 


l  g  (n,k)  ("] 
c=l  L 


E[G(k)]  +  E [J (y (k) ] +1  , 


£  g(n,k)<  A,  vn  , 

k=l 


l  g (n,k)<  A,  Vk  . 


(5.3.45) 


(5.3.46) 


(5.3.47) 


Proof  of  Lemma  5.3.5:  Consider  the  linear  (time -varying)  system 


n-1 

X  ”  A  y  c(n|k)X,  +  U  ,  (5.3.48) 

”  til  k2  k  ” 

where  A  is  the  constant  in  (5.3.40).  By  linearity,  there  exists  a  kernel 
g(n,k)  such  that 


n 

X  =  l  g (n,k)U  .  (5.3.49) 

n  k=l  K 


If  we  let 


Un  =  A^E[G(n)]  +  E[J(y(n))]+lj  (5.3.50) 

2 

and  use  (5.3.40)  we  obtain  (by  an  easy  induction)  E[b  (n)]<  Xn,  Vn,  which 
is  essentially  (5.3.45). 

In  order  to  prove  (5.3.46),  we  must  show  that  the  step  response  of 

(5.3.48)  is  bounded.  That  is,  we  assume  U  =1,  Vn,  and  we  must  show  that  the 

n 

resulting  sequence  {X  }  is  bounded.  Let  us  define  a  sequence  {B  }  by  B  =1  and, 

n  n  1 

for  n>l. 


B  =  max 
n 


K-i'  A  J,  ~2  c<nlk)Bn-i+1}  • 

J\*“A 


(5.3.51) 


It  follows  from  (5.3.48),  with  U  =1,  that 

n 


B  >  max  X,  . 
n  -  ,  .  k 
k<n 


(5.3.52) 


So,  it  is  sufficient  to  show  that  {B^}  is  a  bounded  sequence. 


But  this  follows 


easily  from  (5.3.51)  and  (5.3.34),  with  ct=0. 


:ni; 


Now,  in  order  to  show  that  (5.3.47)  holds,  we  need  to  show  that  the 

response  of  the  system  (5.3.48)  to  an  impulse  at  time  k^  is  summable  and 

that  the  sum  is  bounded  by  a  constant  which  is  independent  of  k^.  So,  we 

assume  that  U,  =1,  U,  =0,  Vk;*k„  and  we  need  to  show 
kQ  k  0 


CO 


l 


n=k 


0 


X  <  A  , 
n  — 


(5.3.53) 


where  A  does  not  depend  on  k  .  Since  the  step  response  of  (5.3.48)  has  been 
shown  to  be  bounded,  the  impulse  responses  must  be  uniformly  bounded;  that  is, 


X  <  A,  Vn  >  k 
n  —  —  0 


(5.3.54) 


where  Xn  is  the  response  to  an  impulse  at  time  kQ  and  A  is  independent  of  kQ. 

We  will  show  by  induction  that  for  every  e>0  and  every  integer  m>_0,  there  exists 
a  constant  A,  independent  of  kQ,  such  that 


r  l+£m 

X  <  An""10  +  A  — - —  c(n|k),  Vn  >  k„  .  (5.3.55) 

n  —  ,  i  1  0  u 

*0 

Let  us  fix  some  £>0.  Clearly,  (5.3.55)  is  satisfied  for  m=0,  due  to  (5.3.54). 

Assume  it  is  true  for  some  arbitrary  integer  m>0.  We  then  use  (5.3.55)  in 

(5.3.48)  (recall  that  U,  =1,  U  =0,  Vk^k 

*0  14 

for  n>kQ: 


and  X,  =1 , X  =0,  Vk<k„)  to  obtain, 

ko  ko  ° 


X 


n-x 

<  A  c  (n|k J+-A  \  ^c(n|k)  Ak 

k  k=k  +1  k  L 

o 


+  A 


■o#J 
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Using  (5.3.34),  with  a=m<5. 


T  c(n|k)k~'"'5  <  an‘lm+1)S, 

k=k  +1  k 


where  A  is  independent  of  kQ.  Also  note  that  c  (n|k)c(k|kQ)  =  c(n[k0)  and  that 


n-1  ,l+£m 

l  ^c(n|k)  — t — c (k | kn) <  n 


k=kQ+l  k 


O'-  2 

k0 


1+em  n-1  1+e  (m+1) 

c(nikQ)  l  ^<a5 - 2 - 


k=k0+! 


c(n|kQ) 


(5.3.56) 


where  A  depends  on  e  but  not  on  kQ.  (The  last  inequality  follows  from  the 

n  , 

r'  1  £ 

fact  that  for  any  e>0,  there  exist  A»A  , such  that  >  —  A  logn  An  .) 

1  k=l  k  1 

This  shows  that  (5.3.55)  is  also  valid  for  m+1,  and  completes  the  induction. 

m(5 

Taking  m  large  enough  so  that  m<5>l,  the  term  An  in  (5.3.55)  becomes 

summable,  and  the  sum  is  independent  of  kQ.  It  remains  to  show  the  summability 
of  the  last  term  in  (5.3.55).  Proceeding  as  in  the  proof  of  Lemma  5.3.3  and 
letting  £Q  =  m£,  we  have: 


(5.3.57) 


Since  £  was  arbitrary,  it  may  be  chosen  small  enough  so  that  £0=m£<6.  Then, 
the  right  hand  side  of  (5.3.57)  converges  to  zero,  as  k0+°°,  and  (a  fortiori) 
it  is  bounded  by  a  constant  independent  of  kQ .  This  concludes  the  proof  of 


Lemma  5. 3. 5. a 


We  may  now  use  the  bounds  given  by  Lemma  5.3.5  in  (5.3.39)  to  obtain, 


after  some  rearrangement, 


1  n 

E  [ J  (y  (n+1) )  ] <  E [ J (y  (n) )  ] - E [G (n)  ]  +A  £  q(n,m)  [E [G  (m)  ]  +E [J  (y  (m) )  ]  +1] 

n  m=l 


(5.3.58) 


where 


n 

q (n,m)  =  £  4>  (n|  k)  g  (k,m)  . 

k=m 


(5.3.59) 


Lemma  5.3.6: 


ao  n 

£  £  q(n,k)<°°,  (5.3.60) 

n=l  k=l 


00 

lim  k  £  q(n,k)=0  .  (5.3.61) 

k-*°°  n=k 


Proof  of  Lemma  5.3.6:  Using  the  definition  of  4>(n  |m)  and  (5.3.46),  we  obtain 
after  some  rearrangement 


00 


l 

n=l 


n  « 

l  q(n,k)<  A  £ 
k=l  n=l 


n 

£  cj)(n|m)<«>  , 

m=l 


(5.3.62) 


where  the  last  inequality  follows  from  (5.3.33),  (5.3.37).  Similarly,  using 
(5.3.47) 

00  00  IX  00  00 

k  l  q(n,k)  =  k  £  £  <$> <n|m) g (mjk)  =  k  £  g(m,k)  £  <j)(n|m)< 

n=k  n=k  m=k  m=k  n=m 

00  (  k  m-1  k  *  k  ) 

<  Ak  sup  T  <f>(n|m)<  A  sup  <  —7  +  £  — 7  c  (m|  l)  +  \  —  c(n|m)  >  < 

m>k  n=m  m>k  (  rn  Jl=l  n=m+l  1 

m-i  1  “l 

<  —  +  A  sup  \  j-  c(m|  2.)  +  A  sup  y  —  c(n|m)  . 

m>k  &=1  m>k  n=m+l 


(5.3.63) 
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The  second  and  the  third  expression  in  the  right  hand  side  of  (5.3.63)  are 
easily  seen  to  converge  to  zero,  as  kr*»,  because  of  (5.3.35)  and  (5.3.36), 
respectively.  This  concludes  the  proof  of  Lemma  5. 3.6.0 
Let  us  define 

n  1 

R  =A  7  q(n,m)E[G(m) ]  -~E[G(n)],  (5.3.64) 

n  ,  n 

m=l 

and  note  that  (5.3.58)  may  be  written  as 

n 

E[J(y(n+l) ) ]<  E[J (y(n) ) ]  +R  +A  7  q (n,m) [E [J (y (m) ) ] +1]  (5.3.65) 

—  n  , 

m=l 

We  also  define  P  by  P ,=E t J (y (1) ) ] , 
n  1 


P  =  P  +  R  +  A  7  q(n,m) [P  +1] 
n+1  n  n  *•, m 

m=l 


(5.3.66) 


and  note  that  P  >E[J(y(n))],  Vn.  By  linearity,  there  exists  a  kernel  V(n,k) 
n— 


such  that 


Pn+1  =  l  V (n,k+l) 


k=l 


j^+A  l  q(k,m)j 


+  V  (n,  1)  P. 


(5.3.67) 


Clearly,  V(n,k)>  1,  Yn,k.  Moreover,  inequality  (5.3.60)  implies 


oo  r  n 

II  1  +  \  q(n,m) 

n=l L  m=l 


<00 


which  leads  to  the  conclusion  that,  for  some  constant  V*,  V'n,k)<V*,  \jn,k. 

Using  Lemma  5.3.6, 

n  k  »  k 

l  V(n,k+1)  l  q(k,m)<V*  l  ^  q(k,m)<A<“  .  (5.3.68) 

k=l  m=l  k=l  m=l 

Using  the  definition  of  R  ,  we  obtain 

n 


n  n  I"  n 

l  V(n,k+1)R  <  l  V*A  l  q(m,k)-  - 

k=l  k=l  [  m=k 


E  [G  (k)  ]  . 


(5.3.69) 


Lemma  5.3.6  implies  that  V*A  £  q(m,k)~  —  becomes  negative  for  large  enough  k, 

m=k  * 

n 


which  shows  that  £  V(n,k+l)Il  is  bounded  above  by  some  constant  A  for  all  n. 
k=l  * 


So,  we  may  conclude  from  (5.3.67)  that  the  sequence  {P  }  is  bounded.  Hence,  for 

n 


some  A>0 , 


E[J(y(n))]<  A,  Vn 


(5.3.70) 


From  (5.3.67),  (5.3.68)  we  conclude  that 

n 


liminf  £  V (n,k+l) R  >-» 

n-H»  k=l  k 


(5.3.71) 


Using  (5.3.69)  and  Lemma  5.3.6  again,  we  obtain 


l  7  E[G(k)  ]< 
k=l  K 


(5.3.72) 


Using  the  monotone  convergence  theorem,  as  in  the  proof  of  Theorem  5.3.1,  the 
proof  of  part  (c)  is  completed. 

We  now  define 


Z(n)  =  £  <j)(n|k)b  (k) 

k=l 


(5.3.73) 


Using  Lemma  5.3.5  and  similarly  with  (5.3.58),  we  obtain 


oo  oo  n 

l  E(Z(n)]<A  l  l  q(n,m)[E[G(m)+E[J(y(m) )  J+l]  . 
n=l  n=l  m=l 


(5.3.74) 


Using  Lemma  5.3.6,  (5.3.70)  and  (5.3.72)  we  conclude  that  £  E[Z(n)]<°°. 

n=l 

Now  recall  inequality  (5.3.38) .  Taking  conditional  expectations,  we  obtain 


E  [J  (y  (n+1)  )|f°]  <  J(y(n))  +E[Z(n)jF°]  , 
n  —  n 


(5.3.75) 


and  using  Lemma  5.3.1,  it  follows  that  J{y(n))  converges  almost  surely. 
We  now  turn  to  the  proof  of  part  (b)  of  the  Theorem.  Let 


B  =  max  max  E  [  | | s1 (k) | | 2] , 
n  i  l<k<n 


D  =  max  E  [  |  |  x1  (n)  -y  (n)  |  |  2  ]  . 
i 


(5.3.76) 


(5.3.77) 


Inequality  (5.3.44)  implies  that 


D  <  8  (n)  B  , 
n+1  —  n 


(5.3.78) 


where  p(n)  is  a  sequence  which  converges  to  zero,  by  (5.3.35) .  Moreover, 
using  (5.3.30), 

E[|  Is^n)  |  |2]<  AE  [J-(xX  (n) )  ]  -  EUaVJCx1  (n) )  j  s1  (n) )  ]  +  A< 

<  AE[J(x1(n)) ]  +  4A2E [ | | VJ (x^ (n) ) | | 2 ]  +  J  E [ | | s1 (n) | | 2]  +  A< 

<  AE[J(xi (n) ) ]  +  ~  E[|  Is^fn)  I |2]  +  A, 


(5.3.79) 


where  the  last  inequality  follows  from  Lemma  5.3.2.  This  finally  implies  that 


E [ I |si(n) ||2]<  AE[J(xi(n))]  +  A. 


Taking  expectations  in  (5.3.42)  we  have 


E  [ J  (x  (n) )  ]  <  AE  [J  (y  (n) )  ]  +  AD 
—  r 

Inequalities  (5.3.80)  and  (5.3.81)  imply 


(5.3.80) 


E  [  |  |s1(n)  | |2]<  AE [J (y (n) ) ]  +  AD  +  A 


(5.3.81) 


(5.3.82) 


B  <  AQ  +  Amax  D,  , 
n  —  ,  .  k 

k<n 


where  Q  =  1+  sup  E[J(y(n))].  Combining  (5.3.78)  and  (5.3.83),  we  have 
n>l 


D  <_  3(n)  (AQ+Amax  D  ) 
n+i.  .  .  K 

k<n 

and  since  B(n)  converges  to  zero,  so  does  D^.  This  proves  that  x1(n)-y(n) 
converges  to  zero  in  the  mean  square. 

As  a  corollary  to  the  preceding,  we  obtain 


sup  E[b  (n)]<°°  . 


(5.3.83) 


(5.3.84) 


Z*  b  (i) ,  k>l  . 
1  /x  1 


(5.3.85) 


(5.3.86) 


k^VoS-D^* 

Using  the  fact  that  there  exists  an  A  such  that  (k+1)  Ak  \vk,  we 

obtain  from  (5.3.86) 


l  E^l<  l 


i  y  i  su 

-1  /r  -i  /r  ^  ^ 


k1/6<i< (k+l) 1/6 


sup  E  [b  (n) 


<A  l  [(*+l)1/5-k1/,S]2  <  A  l  -± 


k  (2/^) “2  <00  _  (5.3.87) 


k=l  k 


k=l  k 


It  follows  that  C  converges  to  zero,  almost  surely.  Consequently,  so  does 

n  _k  5 

C^  and  £  dn  C^  as  well.  Let  N  denote  the  largest  integer  such  that  N<n  and 


use  (5.3.43)  to  obtain 
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n-1  i  'T'  r-, 

|  |xL(n)-y(n)  |  |<  A  £  -  c (n jk)b (k)<  A  2-r  1/S  2^  i/fi 

k=l  l<k<n  -1  k V  <i<(k+l) V 

N 

y''  Tb(i)<  A  y  dN-kC  .  (5.3.88) 

As  n  converges  to  infinity,  so  does  N  and,  by  the  above  discussion,  x'L(n)-y(n) 

converges  to  zero,  as  n^00.  Consequently,  x1  (n) -x"1  (n)  also  converges  to  zero, 

for  any  i,j,  completing  the  proof  of  part  (b) . 

Finally,  since  J(y(n))  converges  and  x1(n)-y(n)  converges  to  zero,  part  (a) 

of  the  theorem  follows,  as  in  the  proof  of  Theorem  5.3.1.B 

We  continue  with  a  corollary  which  shows. that,  under  reasonable  conditions, 

convergence  to  a  stationary  point  or  a  global  optimum  may  be  guaranteed.  We  only 

need  to  assume  that  away  from  stationary  points  some  processor  will  make  a  positive 

improvement  in  the  cost  function.  Naturally,  we  only  require  the  processors  to 

make  positive  improvements  at  times  that  they  are  not  idle,  that  is  at  times  tST^. 

We  first  need  some  technical  background:  an  integer-valued  random  variable 

t  is  called  a  "stopping  time"  (with  respect  to  {F  })  if  and  only  if  the  event 

n 

{t<n}  belongs  to  F  ,  for  every  n.  Given  a  stopping  time  t,  we  define  F  to  be 
—  n  t 

the  O-field  generated  by  those  events  A6F  such  that  aD {t<n}  belongs  to  F^,  for 
every  n.  Intuitively,  F  describes  all  events  which  have  occurred  up  to  the  random 
time  t. 

Corollary  5.3.1:  Suppose  that  for  some  K^>0,  y1 (n)  >Kj/n,  Yn,i.  Assume  that  H 
is  finite  dimensional,  J  has  compact  level  sets  and  that  there  exist  continuous 
functions  g^:  H-*-[0,°°)  such  that 


<  A 


,N- (k+1) 


l<k<n  -1 


i  6  .6 

j-  dn  b (i) 
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V£J(x1(t))E[sJ(t) |Ft]<-g^(x1(t>) ,  (5.3.89) 

for  every  stopping  time  t  satisfying  tST^,  with  probability  1. 

M  L 

We  define  g:  H-^fO,00)  by  g(x)  =  J  J  g.  (x)  and  we  assume  that  any  point 

i=l  £=1  * 

xeH  satisfying  g(x)=0  is  a  stationary  point  of  J.  Finally,  suppose  that  the 
difference  between  consecutive  elements  of  T^  is  bounded,  for  any  i,4  such  that 
T^(j).  Then, 

a)  Under  the  Assumptions  of  either  Theorem  5.3.1  or  5.3.2, 

liminf  | | VJ (x1 (n) ) | | =0 ,  Vi,  a.s.  (5.3.90) 

n-xa 

b)  Under  the  Assumptions  of  Theorem  5.3.1  and  if  (for  some  £>0)  V1(n)>£,  Vi,n, 
we  have 

lim  I jVj(x1(n) l 1=0,  Vi,  a.s.  (5.3.91) 

n-K» 

and  any  limit  point  of  {xL(n)}  is  a  stationary  point  of  J. 

c)  Under  the  Assumptions  of  either  Theorem  5.3.1  or  5.3.2  and  if  every  point 
satisfying  g(x)=0  is  a  minimizing  point  of  J  (this  is  implicitly  assuming  that 
all  stationary  points  of  J  are  minima) ,  then 

lim  J(x1(n))  =  inf  J (x)  .  (5.3.92) 

n-*°  x6H 

Proof  of  Corollary  5.3.1:  From  part  (c)  of  either  Theorem  5.3.1  or  5.3.2  and 
(5.3.89)  we  obtain 

II  I  i  *y1(n)gp  (x1  (n)  )<«, 
i=l  4=1  n€T. 


a.s . 


(5.3.93) 
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Because  of  our  assumption  on  the  sets  T^,  it  follows  that  there  exists  a  positive 

integer  c  such  that,  for  any  i,£,m,  the  interval  { cm+1 , cm+2 , . . . ,c (m+1) }  contains 

at  least  one  element  of  T^.  Let  us  choose  sequences  of  such  elements  denoted 
i  * 

by  t.  .  By  (5 . 3 .93) ,  we  have 
A/  /  m 


ML00..  ... 

i=l  1=1  m=l 


(5.3.94) 


Now  notice  that,  for  some  constant  K^>0, 


i  i  K4  K4  K5 

Y  (t0  )>  — : - >  ■  -  >  —  ,  Vi,£,m 

£,m  —  i  —  c(m+l)  —  m 

tJl,m 


(5.3.95) 


Hence,  (5.3.94)  yields 


»  J.  J-,  a's' 

m=l  i=l  x.=l 


(5.3.96) 


From  either  Theorem  5.3.1  or  5.3.2  and  its  proof  we  obtain 
lim  (x  (n)-y(n))  =  lim  (y (n+1) -y (n) ) =0  which  implies  that 


lim  (x1(t5  )-y(t*  ) )  =0 ,  \fi,&  . 

m^oo 


(5.3.97) 


Since  J  has  compact  level  sets  and  J(y(n))  converges,  the  sequence  {y(n)}  is 
bounded.  We  therefore  need  to  consider  the  functions  g^  only  on  a  compact  set 
on  which  they  are  uniformly  continuous.  Therefore, 


lim  (g^xStJ  ))-g5(y(t|  )))=0,  vi,£  . 

m-~>  *  l.m  l  l,m 


(5.3.98) 


Clearly,  we  can  choose  these  sequences  so  that  each  t. 


is  a  stopping  time . 


70- 


h-. 

i 


By  combining  (5.3.96)  and  (5.3.98)  we  obtain 


r  ■ 


i  ■ 


ML. 

liminf  £  l  g» (y  (t  ))=0  . 


(5.3.99) 


nr*»  i=l  l=l 


a)  By  (5.3.99),  there  must  be  some  subsequence  of  {t^  }  along  which 

1  fin 

g(y(t^  ))  converges  to  zero.  Let  y*  be  a  limit  point  of  the  corresponding 

l,m 

subsequence  of  {y(t1  )}.  By  continuity,  g(y*^0  and  by  assumption,  y*  must  be 

1  ,m 

a  stationary  point  of  J,  so  7J(y*)=0.  Moreover,  x'*'  (t^  )  also  converges  to  y* 

1  ,m 

along  the  same  subsequence.  By  continuity  of  7J,  (5.3.90)  follows. 

b)  In  this  case,  (5.3.94)  implies 


M  L 

lim  l  l  gj(x1(tj  m))=0,  a.s. 
m-w  i=l  £=1 


(5.3.100) 


and  the  rest  of  the  proof  is  the  same  as  for  part  (a) ,  except  that  we  do  not 
need  to  restrict  ourselves  to  a  convergent  subsequence. 

c)  From  part  (a)  we  conclude  that  some  subsequence  of  {y (t^  )}  converges  to 

1 ,  m 

some  y*  for  which  g(y*)=0.  Consequently,  y*  minimizes  J.  Using  the  continuity 
of  J, 


liminf  J(y(n))<  liminf  J(y(t^  ^))<  J(y*)  =  inf  J (x)  . 
rr*»  n-*»  ,n  x€H 

On  the  other  hand,  J(y(n))  converges  (part  (a)  of  either  Theorem  5.3.1  or  5.3.2) 


which  shows  that  (5.3.92)  holds.  || 


We  now  discuss  the  above  corollary  and  apply  it  to  the  examples  of 

Section  5.2.  Notice  first  that  the  assumption  on  states  that,  for  each 

component  i,  the  time  between  successive  computations  of  s^  is  bounded,  for  any 

computing  processor  i  for  that  component.  Such  a  condition  will  be  always  met  in 

practice.  The  assumption  yX  (n) >_  K^/n  may  be  enforced  without  the  processors 

having  access  to  a  global  clock.  For  example,  apart  from  the  trivial  case  of 

constant  step-size,  we  may  let  y'L(n)=l/t1,  where  t1  is  the  number  of  times, 

n  n 

before  time  n,  that  processor  i  has  performed  a  computation. 

For  Examples  I  and  III,  (5.3.89)  holds  with  g^ (x)  a  constant  multiple  of 

2  ^ 

|V^J(x)|  ;  for  Examples  II  and  IV,  it  holds  with  g  (x)  a  constant  multiple 

2 

of  | |Vj(x) | |  .  We  may  conclude  that  Corollary  5.3.1  applies  and  proves  convergence 
for  Examples  I-IV. 

5.4  DECENTRALIZED  STOCHASTIC  ALGORITHMS  WITH  CORRELATED  NOISE 

We  have  considered  thus  far  stochastic  algorithms  with  decreasing  step-size, 
under  the  crucial  descent  Assumption  5.3.3.  However,  there  is  a  large  number  of 
applications  of  stochastic  algorithms  in  which  such  an  assumption  is  violated  [Ljung, 
1977a].  Many  such  applications  are  related  to  identification  of  ARMAX  models, 
or  adaptive  control  of  stochastic  systems  [Ljung,  1977b] .  Martingale  theory  is 
not  directly  applicable  to  the  study  of  such  algorithms,  but  results  have  been 
obtained  via  other  approaches.  Ljung  [1977a]  has  shown  that  convergence  may  be 
studied  in  terms  of  the  stability  properties  of  an  autonomous,  deterministic, 
ordinary  differential  equation  (ODE) .  (Hence,  this  method  c>f  analysis  has  been 


called  the  "ODE  approach" . )  The  proof  exploits  the  fact  that  as  the  step-size 
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becomes  smaller,  the  algorithm  evolves  in  a  progressively  slower  time  scale, 
while  the  dynamics  of  the  underlying  uncertainties  evolve  in  a  constant  time 
scale.  Hence,  we  have  -asymptotically  -  a  separation  of  time  scales;  the  sto¬ 
chastic  effects  may  be  averaged  out  and  we  are  left  with  a  deterministic  ordinary 
differential  equation.  Similar  results  have  been  obtained  by  Kushner  and  Clark 
[1978]  and  have  been  further  developed  and  unified  by  Metivier  and  Priouret  [1984]  . 

The  main  deficiency  of  the  ODE  approach  is  that  convergence  may  be  proved 
only  under  the  assumption  that  the  algorithm  returns  infinitely  often  to  a 
bounded  region.  This  assumption  needs  to  be  verified  by  means  of  different  ap¬ 
proaches,  or  it  may  have  to  be  enforced  directly  by  modifying  the  algorithm,  e.g. 
by  projecting  the  relevant  variables  back  into  a  bounded  region,  whenever  the 
algorithm  moves  away  from  this  region.  This  may  create  new  problems  because  the 
bounded  region  into  which  we  project  should  contain  the  desired  point  of  convergence 
and,  typically,  this  point  is  itself  unknown. 

In  Section  5.3  we  have  shown  that  the  natural  decentralized  versions  of 
descent-type  stochastic  algorithms  have  essentially  the  same  convergence  properties 
as  their  centralized  counterparts.  We  will  show  below  that  the  same  is  true, 
without  the  descent  assumption.  In  particular,  we  will  show  that  the  main  conclusions 
of  the  ODE  approach  still  hold  (appropriately  modified) .  Of  course,  since  a 
"returning"  assumption  is  essen'  al  for  proving  convergence  of  centralized  algo¬ 
rithms,  such  a  condition  has  to  be  assumed  for  decentralized  algorithms  as  well. 

We  adopt  again  the  model  of  computation  (and  the  corresponding  notation)  of 
Section  5.2.  We  now  introduce  some  assumptions,  which  will  be  discussed  later. 

In  particular,  we  assume  the  following; 


(i)  H  is  a  finite  dimensional  Euclidean  space. 

(ii)  lim  $1^(njk)  exists  for  all  i,j,k  and  is  independent  of  i. 
n-*°° 

(See  Lemma  5.2.1  for  some  sufficient  conditions  for  this 
assumption  to  hold) .  • 

(iii)  Communication  Delays  are  zero.  For  future  reference,  we  collect 
here  the  relevant  equations  from  Section  5.2.  Namely,  for  any 
i,  n>m>l  we  have: 


xi(n+l)  =  l  (n+l|n-l)x^  (n)  +  y1 (n) s1 (n) 

j-1 

Mr..  n  .  . 

x1(n+l)  =  l  (n+llm-Dx3  (m)+  £  41-1  (n+1 1  k) y3  (k)  s3  (k)_ 

j=lL  k=m 

M 

y (n)  =  £  03 (n-l)x] (n) , 

j-1 

M 

y  (n+1)  =  y(n)  +  yNn)  £  $3(n)s3(n)  . 

j-1 


(5.4.1) 


(5.4.2) 


(5.4.3) 


(5.4.4) 


Also, 


Assumption  5.4.1:  y  (n)  =  1/n,  vn,i. 

As  sumption  5.4.2:  There  exists  a  stochastic  process  {<J>  (n)  } ,  taking  values  in 
a  Euclidean  space  S,  defined  on  some  probability  space  (£2,F,P),  and  a  set  of 


functions  Q  :  INxHxS  H,  i=l,...,M,  such  that 


s1(n)  =  Q1  (n^1  (n)  ,$  (n) )  ,  Vi,n 


(5.4.5) 


Moreover,  there  exist  functions  :  Sx  R  -*■  R,  K^H  +  R  and  :  R  -*■  R  such  that 


I QX  (n , x , 4>) -Q1  (n, x ’  ,4>)  i  <  K,  (1>,C)  x-x'l  ,  Vn,(j),i  . 


(5.4.6) 


For  any  A>0,  let  m(n,A)  be  a  sequence  of  integers  such  that 


m(n,A)-l  m(n,A) 

l  U4*  l  r- 


(5.4.16) 


i=n 


i=n 


and  note  that  for  large  n, 

m(n,A) 


A*£n 


(5.4.17) 


which  implies  that  m(n,A)  is  approximately  equal  to  ne^ 


Assumption  5.4.5:  There  exists  a  twice  continuously  differentiable  function 
V:  H  -*■  [0 ,°°)  with  a  unique  global  minimum  x*  satisfying 
3V 


3x 


(x)  f  0,  x  ±  x* 


(5.4.18) 


V(x*)  =  0,  lim  V(x)  =  <*> 
I  x | -*°° 


(5.4.19) 


and  such  that  for  some  A^>0 ,  and  for  any  A>0,  x6H, 


lim  sup 

n-x» 


m(n,A) 


k=n 


-  f(k,x) 
k 


•il 


)  |  |2a 


(5.4.20) 


Assumption  5.4.6:  For  some  A2>0  and  for  any  A>0,  x€H, 


lim  sup  max 
n-wo  n<k<m(n,A) 


p  1 

1 1 3v  ,  , 

I  r  f(t,x) 
t=n 

iA2||S  (X) 

A, 


lim  sup  max 
nr*»  n<k£m(n,A) 
l<i<M 


k  M  , 


l  “  l  iXJ  (kltlf3  (t,x) 
t=n  j=l 


3v  , 

—  ^2 

3x  “ 

(5.4.21) 


(5.4.22) 


'0 


lim  sup  max  ||$1:,(k|n)  -  (n)  |  |  <|—  ,  Vi,j  .  (5.4.23) 

n-H»  k>m(n,A  ) 

—  o 

Remarks  and  Interpretation  of  the  Assumptions 

1.  The  assumption  that  communication  delays  are  zero  is  not  essential  for 
the  result  to  be  proved.  Similar  results  may  be  obtained  under  the  assumption 
that  communication  delays  are  bounded,  provided  that  the  returning  condition 
(see  the  statement  of  Theorem  5.4.1)  is  slightly  strengthened  to  require  that  at 
the  times  n^  the  magnitude  of  state  x1 (n)  of  each  processor,  as  well  as  of  any 
message  that  has  been  transmitted  but  not  yet  received  are  bounded  by  the  constant  C 

2.  Assumption  5.4.1  may  be  generalized  to  allow  for  a  larger  class  of 

00 

sequences  {y(k)}  satisfying  £  y(k)  ~  Y  y^(k)<°°,  for  some  p>l  (see  Ljung 

k=l  k=l 

[1977a] ) ,  including  choices  where  processors  need  not  to  have  access  to  a  global 
clock.  Nevertheless,  the  choice  {l/k}  is  representative  enough  and  simplifies 
notation . 

3.  Assumption  5.4.2  states  that  the  updates  s1 (n)  are  a  function  of  the 
current  state  vector  x1  (n)  and  an  underlying  random  process  {<{>  (n)  } ,  not 
necessarily  white.  Inequalities  (5.4.6),  (5.4.7) ,  (5.4.8)  may  be  easily  checked 
if  Q1  is  known.  They  effectively  require  that  Q1  is  twice  continuously  differen¬ 
tiable  and  that  the  derivatives  (viewed  as  functions  of  4>)  increase  slower  than 
some  polynomial. 

4.  Assumption  5.4.3  defines  a  model  for  the  underlying  stochastic  process 
(4>(n)}.  While  this  structure  is  convenient,  it  is  far  from  necessary.  In  fact 
any  non-linear  model  for  {<J>  (n)  }  would  still  lead  to  the  same  results,  provided 
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that  the  dependence  of  4>  (n)  on  <}>(k)  decreases  "fast  enough",  as  n-k  increases. 
Assumptions  5.4.3  and  5.4.4  are  only  used  to  prove  the  "averaging"  Lemma  5.4.3. 
Consequently,  our  results  remain  valid  whenever  the  conclusions  of  Lemma  5.4.3 
can  be  shown  to  be  true,  possibly  using  different  means. 

5.  The  main  difference  of  our  assumptions  from  those  of  Ljung  [1977a]  is 
that  we  do  not  allow  any  feedback  from  the  algorithm  to  influence  the  evolution 
of  the  process  {0 (n) } .  For  this  reason,  the  Theorem  that  follows  is  not 
applicable  to  algorithms  for  adaptive  control,  nor  to  certain  identification 
algorithms.  More  general  models,  allowing  feedback,  are  considered  later. 

6.  Assumption  5.4.5  guarantees  that  the  direction  of  updates  is  a  descent 
direction  with  respect  to  the  function  V.  A  simpler  version  of  Assumption 
5.4.5  would  be  to  require 

f(k,x)  •  (x)  <_  -A^  |  ] (x)  |  |  2 ,  Vk,x,  (5.4.24) 

or  the  even  stronger  version 

fi(k,x)  •  |^-  (x)  (x)||2'  Vk,x,i  .  (5.4.25) 

Our  version  (5.4.20)  has  the  following  advantages  over  (5.4.24)  or  (5.4.25): 

a)  We  do  not  require  the  direction  f1(k,x)  of  update  of  eat  .  processor  to 

“  i  i 

be  a  descent  direction.  Conditions  are  only  placed  on  f(k,x)  =  £  $  (k)f  (k,x) 

i=l 

which  is  in  a  sense  the  total  direction  in  which  the  processors  (viewed  as  a  whole) 


update . 
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m(n,A)  1 

b)  Our  condition  in  on  £  —  f(k,x)  which  is  an  average  direction  of 

k=n 

update  over  some  time  interval,  rather  than  f(k,x)  which  is  the  direction  of 
update  at  a  single  time  instance.  In  terms  of  applications,  Assumption  5.4.5 
has  the  advantage  that  it  allows  f(k,x)  to  be  zero  at  some  times.  Thus,  we  do 
not  require  the  processors  to  obtain  measurements  or  perform  computations  at 
each  time  instance.  Rather,  we  put  a  constraint  on  the  average  amount  of  com¬ 
putations  they  will  perform  over  some  time  interval. 

7.  Assumption  5.4.5  is  much  weaker  than  the  pseudogradient  Assumption 

i  9V 

5.3.3.  This  is  because  the  inequality  f  (k,x)  •  (x)  <  0,  vx,  does  not 

imply ,  in  general , 

E[Q1(k,x1(n)  ,<j)(x)  )  |fj>(n-l)  ,x1(n)  ]  •  |^-  (x1(n))<  0 

when  the  process  {<f>  (n) }  is  non-white. 

8.  Inequality  (5.4.20)  also  implies  (using  the  Schwartz  inequality) 

that 

m(n,A) 

lim  inf  £ 

n-x»  k=n 

9.  Assumption  5.4.6  requires  that  the  magnitude  of  the  expected  updates 

9  v 

is  not  too  big,  compared  with  | |^-  (x) |  j .  It  is  satisfied,  in  particular,  if, 
for  some  A 

llfl(t,x)||<  A|||“  (x)||,  Vx, t,i,  (5.4.27) 

but  also  covers  a  few  more  general  cases  (see  Remark  6) . 


zr  f  (k,x) 
k 


1  AXA 


3V  /  , 

ST  <lt) 


(5.4.26) 


10.  Note  that  if  s  (n)  is  bounded  by  some  constant  A,  for  each  i,n,  the 
vector  x1  will  move  by  an  amount  of  the  order  of  AQ  during  the  time  interval 
tn,m(n,AQ)].  So,  this  time  interval  may  be  viewed  as  unit  time  in  a  time  scale 
naturally  associated  with  the  algorithm.  Assumption  5.4.7  implies  that  the 
disagreement  between  processors  tends  to  decrease  by  a  factor  of  d<l  during  such 
a  time  unit.  The  above  statement  can  be  made  precise  as  follows: 

Let  D  (n)  =  max  j  j  x1  (n)  -x"1  (n)  |  |  and  suppose  that  sX(k.)=0,  Vk>n,  Vi.  Then, 

i/ j 

Assumption  5.4.7  implies  that,  for  n  large  enough,  D (m(n,AQ) ) <^  d  •  D(n+1).  (A 

proof  may  be  easily  obtained  from  (5.4.23)  and  Lemma  5.2.2  (vi) ) . 

Natural  conditions  that  guarantee  that  Assumption  5.4.7  holds  may  be 

easily  obtained,  as  in  Lemma  5.2.1.  Let  us  just  point  out  here  that  for  the 

specialization  case.  Assumption  5.4.7  is  satisfied  if  every  processor  communicates 

to  every  other  processor  once  during  the  time  interval  [n,m(n,AQ) ] .  Since 
AQ 

m(n,AQ)«ne  ,  this  allows  the  time  between  consecutive  communications  to  increase 
exponentially.  So,  for  the  time  being,  we  allow  the  communication  process  to  be 
even  slower  than  that  assumed  for  the  purposes  of  Theorem  5.3.2. 

We  now  proceed  to  the  first  result  of  this  section: 


Theorem  5.4.1:  Let  C<°°.  Let  Assumptions  5. 4. 1-5. 4. 7  hold.  Then,  there  exist 
constants  A* (C) >  0,  d*(C)e[0,l)  such  that:  if  the  constants  AQ,  d  of  Assumption 
5.4.7  satisfy  AQ<A* (C) ,  d£d* (C) ,  then  the  following  is  true: 

If  for  almost  air  co€ft  there  exists  a  subsequence  {n^}  of  the  positive  integers 


such  that 


I  ll  C,  Vi,k 


(5.4.28) 
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and  if  the  point  x*  minimizing  V  satisfies 

j |x*| |<  C,  (5.4.29) 

then 

lim  x1(n)  =  x*,  Vi,  almost  surely.  (5.4.30) 

n-x» 

Remark:  Later  in  this  section  we  will  indicate  differences  and  similarities 

with  the  results  of  Ljung,  and  how  an  ordinary  differential  equation  might  enter 
the  picture. 

Notation:  For  convenience,  we  present  here  a  list  of  some  notation  to  be  used 
in  the  course  of  the  proof.  In  particular,  given  some  n6  IN,  yeH, 

A>0,  C>0,  we  let: 


C(n)  =  max^ 

[| jy) j  ,  max{ ) |x1(n) J ] }} 

(5.4.31) 

i 

D(n)  =  max 

| |x1(n)-x^ (n) | | 

(5.4.32) 

i/ j 

P(n)  =  | | y (n) — y [ | 

(5.4.33) 

q (n,A,C)  = 

m(n,A) 

l  £  (0(JO  ,C) 

k=n 

(5.4.34) 

£  (n,  A,  y)  = 

m(n,A)  M 

l  f  I  *  lk)l8  (k,y  ,(J>  (k) )  -f  (k,y)  ] 

k=n  i=l 

(5.4.35) 

k  M  .  _  ,  _ 

e1(n,A,y)=  max  1^1  $1  (t)  [Q1  (t,y,$  (t) ) -tx  (t,y)  ]  (5.4.36) 

n<k<m(n,A)  t=n  i=l 

k  i  M  i . 

e  (n,A,y)=  max  max  I  r  I  $1;i(k|t)  • 

i  n£k<m(n,A)  t=n  j=l 

[Q3  (t,y,if  (t) ) -f-’  (t,y)  ]  j  | 


(5.4.37) 


F  (n,A,y) 


(5.4.38) 


m(n,A)  . 
k=n 


F  (n, A,y)  =  max 

n<k<m(n,A) 


k 

I 

t=n 


-  1  — 

I  t  £(t,y) 


F  (n,A,y)  =  max  max 

i  n<k<m(n,A) 


k  1 
l  r 

M  .  . 

1  $13 (k|t)f3 (t,y) 

t=n 

j=l 

e (n,A,y)  =  e1(n,A,y)  +  e2(n,A,y) 


G (n,A,y)  =  e (n,A,y)  +  F1(n,A,y)  +  F2(n,A,y)  . 

We  also  recall  for  future  reference  some  elementary  inequalities, 
most  of  them  consequences  of  Lemma  5.2.2: 


|x1(n)-y(n)  |  |<_  D(n) 


M 


l  *13  (kln-Dx-1  (n)  |  |<  C(n) 

j=l 

|  |  y  (n)  |  |  <_  C  (n) 


M  i  j  i  j  ii 

£  [4>  1  (k|n)-$  2  (k|  n)  ]  a3  |  |  <_  max  [  |a  1-a  2[  (<_  2  max|  la1! 
j=l  i. / i-  i 


VA 


where 


M  i  j  i  j  • 

[  l  [t  (k|n)-$  (k|n) ]aJ | |<  2Mc(k|n)  max 

j=l 


I  A  12, 

a  -a 


11'12 


c(k|n)  =  max  j|<J>  3  ( t| n)  -4>3  (n)  |  |  . 

i.j 

The  proof  of  Theorem  5.4.1  proceeds  through  a  sequence  of  Lemmas. 


(5.4.39) 

(5.4.40) 

(5.4.41) 

(5.4.42) 

(5.4.43) 

(5.4.44) 

(5.4.45) 

(5.4.46) 

(5.4.47) 

(5.4.48) 
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Lemma  5.4.1:  Let  {R(k)}  be  a  sequence  of  zero-mean  random  variables  such 

I  k-i  I 

that  E[R(k)R(j)]<  AA. 1  ,  for  some  A>0,  Ae[0,l).  Then  for  any  A>0, 


lim  max 
n-«o  n<k<m(n,A) 


I 


Z=n 


=0,  almost  surely  . 


(5.4.49) 


Proof :  The  proof  is  essentially  given  by  Ljung  [1977a] :  An  ergodic  theorem 

of  Cramer  and  Leadbetter  implies  that 


1  v 

lim  —  2  R(k)=0,  almost  surely 


n-*»  k=l 


Let 


Then, 


1  v 

(n)  =  -  l  R  (k)  . 


k=l 


z(n)  =  z(n-l)  +  —  (R (n) -z (n-1) ) 
n 


and,  therefore. 


max 

n<k<m(n,A) 


l  tk<*> 


1 

=  max  |  z  (k) -z  (n-1)  +  \  ^-z(Jl-l)|<_ 

n<k<m(n,A)  l=n 

<  (2+A)  sup  | z (k) |  , 
k<n-l 

which  converges  to  zero  by  virture  of  (5.4.50).o 


(5.4.50) 


(5.4.51) 


(5.4.52) 


(5.4.53) 
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m(n,A)<  an 


(5.4.59) 


m 
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Therefore, 


and 


E[w4(k|n)]< 

n 


®  m(n,A) 


,  n<k<m(n,A) 


l  l  E[w  (k|n) ]<  l 


24A 


(m(n 


n=l  k=n 


n=l  n 


■A)-"'(rr) 


\ 


<00 


(5.4.60) 


(5.4.61) 


Therefore, 


m(n,A) 

lim  \  w  (k|n)=0,  almost  surely,  (5.4.62) 

n-*»  k=n 

which  implies  (5.4.57).  q 

Lemma  5.4.3:  Given  C>0,  there  exist  constants  A',  A(C)  such  that  for  any 
yeH,  | | y | | <_ C,  and  for  any  A>0,  the  following  are  true,  almost  surely. 


a) 

lim  e (n,A,y)=0, 

(5.4.63) 

rr+* 

b) 

lim  sup  q(n,A,C)<  A(C)A, 

(5.4.64) 

ir*» 

c) 

lim  e  (n,A,y)=0, 
rr*»  1 

(5.4.65) 

d) 

lim  e2(n,A,y)=0, 

(5.4.66) 

n-*°° 

e)  lim  sup  G(n,A,y)<  A1 | ||^  (y) | |A  .  (5.4.67) 

ir+°°  dy 


Proof;  a)  Note  that  0<e(n,A,y)£  e^(n,A,y)  and  for  (5.4.63)  it  is  sufficient 
to  prove  (5.4.65). 

b)  By  (5.4.34), 

m(n,A) 

q(n,A,C)  =  l  ±  EtK^Oc)  ,C)]  + 
k=n  x 

m(n,A) 

+  l  —  [Kx  (0  (Jc)  ,C)  -  E [K  (<J>  (k)  ,C)  J  ]  (5.4.68) 

k=n 

By  virtue  of  inequalities  (5.4.8)  and  (5.4.13),  it  follows  that  the  first  sum 
is  bounded  above  by  a(C)A  for  some  A(C)>  0.  Concerning  the  second  sum,  we  use 
linearity  (Assumption  5.4.3)  to  write,  for  k>t, 

<J>(k)  =-  «Mk|t)<Mt)  +  $(k|t)  ,  (5.4.69) 

where  <J>(k|t)  is  independent  of  <J>(t)  .  Also,  (5.4.11)  implies 

i  i<JMk|  t)  |  |<  clk_t  (5.4.70) 

for  some  c>0,  Xe(0,l) .  Then 

|Cov[K^  (<J)  (k)  ,C)  ,  K  (♦(t)  ,C)]  |< 

<  Icovl^^tklt)  ,C) ,  Kx (<J> (t)  ,C)  J  |  + 

+  | Cov£{k^  ($  (k  j  t)  +  ij>(k|t)<|»(t)  ,C)  -  ($  (Jt|  t),C)  }r  Kx  (<|>  (t)  ,C)J  [  (5.4.71) 

The  first  term  in  the  right-hand  side  of  (5.4.71)  is  zero  because  $(k|t)  is 
independent  of  (j)  (t)  .  Concerning  the  second  term  we  use  the  Schwartz 


inequality  and  (5.4.8)  to  bound  it  by 
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E^|K1($(k|t)-Hp(k|t)<^(t)  ,C)  -  Kx  ($  (lc|  t^o|  ]  E^l  Kx  (<j)  (t) ,C)  |  j 

<  E  [( J  1 0 <Jc)  |  i N  +  |  ||(k|t)  |  |N)2  K3(C)2|  |4/(k|t)  |  |2|  (4>(t)  |  |2  ]  '  • 
*  Et  1^ (4> (t) ,C)  |2]1/2  . 


(5.4.72) 


Using  (5.4.13)  and  inequality  (5.4.70),  we  conclude  that  there  is  some  new 
constant  A(C)>_  0  and  some  Xe[0,l)  such  that 

E[{K1(<Hk),C)-E(K1(4>(k),C)]}  •  -  E[KX  (<J>  (t)  ,C)  ]  }] 


<  A(C)X'k_t' ,  Vk,  t  . 


(5.4.73) 


Using  Lemma  5.4.1,  we  can  see  that  the  last  sum  in  (5.4.68)  converges  to  zero, 
thus  proving  (5.4.64) . 

c)  The  proof  is  identical  to  that  of  part  (b)  .  Instead  of  (5.4.8),  we 
use  (5.4.7)  to  conclude  that 


CovtQ  (k,<j>(k)  ,y)  ,Q  (t,<J> (t)  ,y)  ]  |  |<  AX 


i  k-t 
<  A X 


(5.4.74) 


for  some  A,  depending  on  y  only;  we  then  use  Lemma  5.4.1. 

d)  Fix  some  i, j , £  and  consider  the  double  sequence  of  scalar  random 
variables  R^(k|t)  defined  by 


R^klt)  =  $Jj(k|t)  [Qj(t,y,<Mt))-fj(t,y)J  .  (5.4.75) 

(The  subscript  l  indicates  that  we  sure  dealing  with  the  &-th  component.) 

We  intend  to  apply  Lemma  5.4.2;  we  therefore  need  to  show  that  condition  (5.4.54) 
is  satisfied.  Without  loss  of  generality,  we  assume  that  ('-.,y,<P(t) )  is 
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zero-mean,  i.e.  f^(t,y)=0,  Vt.  For  the  purposes  of  the  current  argument 

j,£,y  are  held  fixed;  so,  for  the  proof  of  this  Lemma  only,  we  will  use  the 

short-hand  notation  Q(t,<j>(t))  instead  of  (t,y  ,<p  (t) )  .  Finally,  note  that 

$A^(k|t)  is  bounded  above  by  1.  Hence,  to  verify  (5.4.57)  we  need  to  show  that, 

for  A>0,  Ae(0,l)  and  for  all  t,  <t  <t ,<t,  , 

1—  2—  3—  4 

t4_tl 

E[Q(t1,<j>(t1))Q(t2,<j>(t2))Q(t3,<()(t3))Q(t4,4)(tiy)]<  AA  *  A.  (5.4.76) 

By  Assumption  5.4.2, 

lQ(t4,4»)-Q(t4,cJ)*)  1<  E*1(||<J>1|,  i  l  1  )  I  l4>-4>*  I  1  »  V<M’  ,  (5.4.77) 

where  is  some  polynomial  independent  of  t4 .  Let 

r4(<|»  =  E(Q(t4,(J)(t4))  |4>  (t3  )»<(!]  .  (5.4.78) 

Then,  using  the  notation  introduced  by  (5.4.69)  and  (5.4.78) 

lr4c  ^  )-r4«t>’)  |  =  |E[Q(t4,^Ct4|t3)4>  +  $(t4|t3))  - 
-  + 

<  E^j  |4»  (t4|  t3)4>  +  $(t4|t3)  I  I  ,  I  |(j;(t4|t3)<j),+  $(t4jt3)  |  |)]- 

l|4»(t4|t3)||.||<(,-<|,'||  (5.4.79) 

(The  expectations  in  the  above  inequality,  are  with  respect  to  $(t  |t  ).)  Note 

t  -t 

that  |  |'Mt4|t3)  |  |<_  c\  ,  where  c,A  are  the  constants  of  Assumption  5.4.3. 
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Moreover,  using  (5.4.13),  it  is  easy  to  see  that  there  exists  a  polynomial 
P  (•,•),  which  does  not  depend  on  t  ,  t  such  that 

4  J  *4 


+  ^(t4l  t3)  I  I  '  I  t3>^'  +  ^t4lt3)ll)] 

1  p20  |4>|  |  /  |  |4>'  I  I )  .  (5.4.80) 

Therefore,  for  some  polynomial  ( • , * ) , 

t  -t 

lr4(<*>)_r4(^,)  I-  cP2(l  1*1  I  '  I  !♦'  I  l)X  (5.4.81) 

Using  (5.4.77),  with  t^  in  the  place  of  t4  and  (5.4.81),  and  defining 

P3(<J>)  =  Q(t3,<J>)r4  (<J>) ,  (5.4.82) 

it  is  easy  to  see  that 

t  -t 

|p3(<t>)-P3(4>,)|iP3(M<}>IMI<J>,!|)X  4  3,  V4>,<|>'  ,  (5.4.83) 

for  some  polynomial  P3(*,*)»  independent  of  t3 ,  t4 .  We  may  then  proceed 
similarly,  and  define 

r3((J>)  =  Elp3(«Mt3))|<t>(t2)H>]/  (5.4.84) 

P2(<J»  =  Q(t2,(j))r3«)»f  (5.4.85) 

r2«j>)  =  E[p2  (<*>  (t2) )  |<|)  (tL)=<|>) ,  (5.4.86) 

p^($)  =  Qtt^,^)  r2  (4>)  • 


(5.4.87) 


Repeating  the  steps  from  (5.4.78)  to  (5.4.82)  a  few  more  times,  we 


conclude  that 


P1c<j»-P1(<j>,)|£  p(|MU|*,||>a 


Vfcl 


(5.4.88) 


for  some  polynomial  P  independent  of  t^ ,  t^ ,  t^ ,  t^.  Consequently, 


V*! 


ECP1(<Mt1))]<  AX 


for  some  A>0.  On  the  other  hand,  note  that 


(5.4.89) 


E[pi(<J)(t1) )  1 


=  E[Q(t  ,<Kt  ))-E[Q(t2.^(t2))*E[Q(t3,^(t  ))  . 

•  E[Q(t4,<|>(t4))|<fr(t3)]|(|>(t2)]|<|»(t1)j]  - 
-  E[Q(t1,4>(tl)  )Q(t2,4>(t2)  )Q(t3,0  (t3)  )Q(t4,<J>  (t4)  )  ]  , 


which  completes  the  proof  of  (5.4.76).  Therefore,  Lemma  5.4.2  applies  to 
R^(k|t)  and  proves  (5.4.66). 

e)  This  follows  from  parts  (c) , (d) ,  together  with  (5.4.21),  (5.4.22) .□ 


Lemma  5.4.4:  Fix  some  u> eft,  c>0,  y€H,  ne  IN,  p>0,  A>A  and  suppose  that 

-  —  o 


(5.4.90) 


| |y (n)-y| |<  p, 

I  |y(n>  I  li  c, 

| lx1 (n) | |<  c,  Vi 


q(n,A,2C)<  - 


F„(n,A,y)  +  e_(n,A,y)  +  7Cq(n,A,2C)<  C 


(5.4.91) 


(5.4.92) 


(5.4.93) 


(5.4.94) 


(5.4.95) 
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Then, 


j |y(m(n,A)+l)  -  y  -  F(n,A,y)J|< 

£  3p  +  16q(n,A,2C)G(n,A,y)  +  e(n,A,y)  + 

+  32 [q(n,AQ,2C)  +  q(n,A,2C) (d+q(n,A,2C) ) ]D(n) , 


(5.4.96) 


where  d,  AQ  are  the  constants  of  Assumption  5.4.7.  (The  exact  values  of  the 
numerical  constants  appearing  in  (5 .4.94) - (5 .4 .96)  are  not  important) . 


Proof  of  Lemma  5.4.4:  From  equation  (5.4.4)  we  obtain 


m(n,A)  .  M  . 

y (m(n,A)+l)=  y(n)  +  £  7  I  *  (k)Q  (k,x  (k),4>(k)) 


k=n  i*l 


which  may  be  rewritten  as 


_  _  m(n,A)  _ 

y(m(n,A)+l)  =  y+(y(n)-y)  +  £  -  f(k,y)  + 


m(n,A)  .  M  _  _ 

+  l  7  l  $  (k)[Q  (k,y,<j»(k))-f  (k,y)  ]  + 
k=n  i=l 


(5.4.97) 


m(n,A)  .  M  .  .  .  _ 

l  ]7  l  $  UO  (Q  (fc,x  (k)  ,<p  (k)  )-Q  (k,y,<Mk))l 
k=n  i=l 


(5.4.98) 


Note  that 


|xx Oc)-y|  |<|  IxNkJ-ytk)  |  |  +  |  |y(k)-y|  |<  D(k)  +  P(k) , 


(5.4.99) 


where  the  last  inequality  follows  from  (5.4.43).  We  also  use  inequality  (5.4.6) 
andll^1  (k)||<  1  to  conclude  that  the  last  sum  in  inequality  (5.4.98)  is  bounded 


(in  norm)  by 
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m(n,A) 

l  ±  ^(^(k)  .COO)  [P(k)+D(k)j 

k=n  "  'L 


(5.4.100) 


We  will  now  proceed  to  obtain  bounds  for  P (k) ,  D (k)  in  terms  of  the 
quantities  introduced  in  the  statement  of  the  Lemma.  First  note  that 


=  max  |  jx'SkJ-x3  (k)  |  |<_  2max|  |  {k)  |  { <_  2C(k)  ,  Vk 


(5.4.101) 


y(k)-y| |<| |y(k) | |+| |y| |<  c(k)  +c 


(5.4.102) 


Using  equation  (5.4.2)  and  (5.4.46),  (with  n£k<m(n,A)+l) 


Jc— 1  M 

||*1(k)||-||  i  $lj(k|n-l)x3(n)  +  J  ^  4>1  j  (k  1 1)  Q3  (t ,  x3  (t> ,  4>  (t> )  |  |  <_ 

j=l  t=n  j=l 


M 


k-1  ,  M  .  .  . 

IT  _  V  A1 3  /I*  I  *-  \  1 4- 


<||  l  G13  (k|n-l)x3  (n)  |  |  +  |  |  l  ~  l  t13  (k|  t)  f3  (t,y)  |  |  + 
j=l  t=n  j»l 


k-1  ,  M 


+  ||  l  ~  l  $13(k|t)[Q3(t,y,<j»(t))-f3(t,y)]||  + 
t=n  j=l 


k-1  M  .  _  . 

+  ||  l  £  l  *13(k|t)  tQ3(t,y,4>(t))  -  Q3(t,x3(t),<J>(t))]||< 
t=n  j=l 

-  k_1  l 

<C(n)  +  F2(n,A,y)  +e2(n,A,y)+  £  ±  ($ (t)  ,C(t) )  (P(t)+D(t)  ]  . 


(5.4.103) 


We  will  use  (5.4.103)  to  prove  inductively  that 


C (k) <  2C ,  ke [n,m(n,A) +1] 


(5.4.104) 


Indeed,  suppose  that  (5.4.104)  holds  for  all  k  such  that  n<k<k,  with 
k£m(n,A) .  We  start  with  inequality  (5.4.103)  and  use  (5.4.101) , (5.4.102) 
and  the  assumption  (5.4.95)  of  the  Lemma  to  obtain 


’  -v  ■ 


-192- 


_  H 

C(k+1)<  C+F2(n,A,y)  +e2(n,A,y)  +  £  -  (<J>  (t)  ,2C)  [3C+4C]< 

t=n 

£C+F2(n,A,y)  +  E2(n,A,y)  +  7Cq(n,A,2C)£  2C  (5.4.105) 

which  completes  the  inductive  proof  of  (5.4.104). 

Bounds  for  P (k) :  Consider  equation  (5.4.98)  again,  but  let  the  upper  limit 
in  the  summation  be  t,  instead  of  m(n,A) ,  where  t  satisfies  n<t<m(n,A) .  Then, 

t  . 

P(t+1)<  p  +  |  |  l  -  f  (k,y)  |  |  + 
k=n 


+  I  r  K  (<|) (k)  , 2C)  [D (k)  +P (k)  ]  .  (5.4.106) 

k=n  1 

Let 

p*  =  max  P(k)  (5.4.107) 

n<k<m (n,A) 

D*  =  max  D(k),  (5.4.108) 

n<k<m (n,A) 

and  use  the  notation  (5.4.39) ,  (5.4.36) ,  (5.4.34)  to  obtain 

P*  <  p+F^n^.y)  +  £1(n,A,y)  +  q(n,A,2C)  [P*+D*]  (5.4.109) 


Bounds  forp(k):  Using  (5.4.2) , (5.4.46) , 


<_  D(n)  +  2F2(n,A,y)  +2e2(n,A,y)  +  2q(n,A,2C)  [D*+P*J  . 


(5.4,110) 


Suppose  now  that  m(n,A0)<_k<m(n,A) +1.  We  use  (5.4.23)  and  (5.4.32)  to 
strengthen  (5.4.110) : 

D(k)<dD(n)  +2F2(n,A,y)  +2e2(n,A,y)  +  2q(n,A,2C) [D*+P*]  .  (5.4.111) 

If  we  now  add  (5.4.109)  and  (5.4.110;,  use  (5.4.94)  to  eliminate  q(n,A,2C) 
and  bring  (D*+P*)  to  the  left-hand  side,  we  obtain 

P*  +  o*  <  4 [d (n) +p+G (n,A,y) ]  (5.4.112) 

Suppose  once  more  that  m(n,AQ)  <k<m(n,A)+l  auid  use  (5.4.112)  in  (5.4.111): 

D(k)<(d+8q(n,A,2C) )D(n)  +  p  +  3G(n,A,y)  (5.4.113) 

Similarly,  using  (5.4.112)  in  (5.4.109): 

P*  <  2p  +  2G(n,A,y)  +  8q(n,A,y)D(n)  (5.4.114) 

We  are  finally  in  a  position  to  bound  (5.4.100): 
m(n,A) 

l  ^  K  (<j>(k)  ,C(k))  (P(k)+D(k)]< 
k=n 

m(n,A  )  m(n,A)  . 

<  l  7  K  (4)  (k)  ,2C)  [P*+D*]+  l  7  K  (<j)(k)  ,2C)  [P*+D(k)] 

k=n  k=m(n,AQ) 

<  4q(n,AQ,2C) [D(n)+p+G(n,A,y) ] 

+  q(n,A,2C)  [3p*5G(n,A,y)  +  (d+16q(n,A,  2Cj)D(n)  ]  .  (5.4.115) 

Using  (5.4.115)  to  bound  the  terms  in  (5.4.98)  we  obtain 
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j jy(m(n,A)+l)  -  y  -  F(n,A,y)jj  < 

_  m(n,  A) 

<P  +  e(n,A,y)  +  J  ±  K  (<j>  (k)  ,C  (k) )  [P  (k)  +D (k)  ]< 
k=n  K  1 

<  3p+e(n,A,y)  +  9q(n,A,2C)G(n,A,y)  + 

+  16[q(n,A0,2C)  +  q(n,A,2C) (d+q(n,A,2C) ) )D(n) .  □  (5.4.116) 

Lemma  5.4.5:  Let  Hq  be  a  countable  dense  subset  of  H.  Given  C2>0,  there  exist 
Aq>0,  A>0,  de[0,l)  and  a  subset  (2*  of  (2  (of  measure  1),  such  that  for  any  ae[0,l) 
yeHQ  (satisfying  {  | y  |  |  <.  C2>  ^or  a1^  there  exists  NQ  so  that  the  following 

is  true: 

If  n>NQ,  | |y(n)-y| |<  aA2,  || x1 (n) j | <  C2  and 

| (|^  (y) | |  +  D(n)>  a  ,  (5.4.117) 

then 

2-2  a2 

V(y  (m(n,A)+l)  +  (D (m(n,A)  +1)  ]  <  V(y)  +  D  (n)  -  mintljA^A}—  (5.4.118) 

where  A^  is  the  constant  of  Assumption  5.4.5. 

Proof :  Let  HQ,  C2>0  be  given  and  suppose,  for  the  time  being,  that  A>0  has  been 

fixed  so  that  A<1  and  so  that  the  assumptions  (5.4.94),  (5.4.95)  of  Lemma  5.4.4 

2 

are  satisfied,  (with  (X^)  for  n  large  enough.  Let  AQ=A  ,  d=A. 

From  Lemma  5.4.3,  we  obtain  results  on  certain  limits,  which,  for  any 
yeH,  are  valid,  except  possibly  in  a  nullset.  Now,  since  HQ  is  countable,  there 
exists  some  Q’CQ  of  measure  one,  such  that  equations  (5.4.63)- (5.4.67)  are  valid 


for  all  yeH^coeft*.  Suppose  now  that  a>0,  yeHQ ,  ( |  |  y |  | ^C^and  weft'  have  been 
fixed.  Let  A(2C^) ,  A'  be  the  constants  of  Lemma  5.4.3,  let  A^  be  the  constant 
of  Assumption  5.4.5,  let  a>0  and  define  p,5  by: 


p  =  aA  , 


5  =  min{l,A^}  — 


Using  the  short-hand  notation 


(5.4.119) 


(5.4.120) 


110  (7)11. 


let  N_  be  large  enough  so  that,  for  any  n>N  , 
u  —  o 


(5.4.121) 


G(n,A,y)<  A'UA  +  a2A2 


q(n,A,2C  )<  A(2C  )A 


q(n,AQ,2C2)<  A(2C2)AQ  =  A(2C.,)A" 


e(n,A,y)<  aA 


_  'Ay  _  2  2 

F(n,A,y)  •  (y)<  -  A^U  A  +  A  a 


Such  a  Nq  exists  by  Lemma  5.4.3,  for  the  first  four  inequalities,  and  by 
Assumption  5.4.5,  for  the  last  one.  Let  B=max{A' ,A (2C2) , 1}  . 

Using  Lemma  5.4.4  and  (5 .4 .122) - (5 .4 .126) , 

_  _  2  2  2 

| |y(m(n,A)+l)-y-F(n,A,y) I l<  3aA  +  16BA[BUA+aA  ]  +  aA  + 


+  32 [BA  +BA(A+BA) ]D(n) 


(5.4.122) 


(5.4.123) 


(5.4.124) 


(5.4.125) 


(5.4.126) 


(5.4.127) 


(5.4.128) 


(5.4.129) 


(5.4.130) 


Using  the  inequality  (5.4.117)  to  eliminate  a,  (5.4.127)  simplifies  to 

j  |y (m(n,A)  +1)  -y-F  (n,A,y)  |  |<  M^BJUA2  +  M1(B)D(n)A2 

where  M^B)  is  some  constant  depending  only  on  B. 

We  also  use  (5.4.113): 


2  2 

D (m(n,A) +1) < (A+8BA) D (n)  +  aA  +  2BUA  +  2aA  < 


<  M2(B)A(D(n)+U) 

where  M2(B)  is  a  constant  depending  only  on  B.  Let 

MMy|^"^<y>IMIS(y>11^ 

and  use  (5.4.128)  in  a  second  order  series  expansion  for  V: 


_  _  'Att  _  2 

V(y(m(n,A)+l))<  V(y)  +  F(n,A,y)  •  (y)  +  M^OJUA  (U+Dfri) )  + 

M 

■a  222242  242  22  2 

+  ~  16 [B  U  A  +a  A  +M1(B)U  A  +M1(B)U  D  (n)A  ]< 

<V(7)  -  AXU2A  +  M4(B)A2 (U2+D2(n)) ,  (5.4.131) 

for  some  new  constant  M4(B).  We  also  square.  (5.4.129)  to  obtain 

(D (m(n,A)+l)  ] 2  <  M5<B)A2(U2+D2(n)}  = 

=  D2(n)  -  (l-Mc(B)A2)D2(n)  +Mc(B)A2U2  .  (5.4.132) 

D  D 

Adding  (5.4.131)  and  (5.4.132),  we  have 


V (y (m(n,A)+l)  +  (D(m(n,A)+l) ]2  £  V(y)  +  D2 (n)  - 

-A,U2A  -  (1-M_ (B) A2 ) D2 (n)  +  (M„(B)+Mr(B))A2(U2+D2(n)) 


(5.4.133) 


Clearly,  if  A  has  been  chosen  so  that 


1  -  M.(B)A2>  j 
5  —  4 


(m4(b)+m5(b))A2<  , 


(m4(b)+m5(b))A  <  * 


(5.4.134) 


(5.4.135) 


(5.4.136) 


V (y  (m(n,A)  +1) )  +  [D(m(n,A)+l)]2<  V(y)  +  D2  (n)  -  U2A  -  j  D2  (n)  (5.4.137) 

2  2  a2 

Now  note  that  U  +  d  (n)>  —  which  shows  that 

inequality  (5.4.118)  is  satisfied.  Moreover,  note  that  A  was  selected  only  as  a 
function  of  B,  which  in  turn  depended  only  on  C2,  as  required. Q 


Lemma  5.4.6:  Under  the  assumptions  of  the  Theorem, 


lim  inf  [V(y(n))+D  (n)]=0. 


(5.4.138) 


almost  surely. 


Proof  of  Lemma  5.4.6:  Let  C<°°  be  the  constant  appearing  in  the  statement  of 
the  Theorem.  Let 


C,  =  max  V(y)  +  4C  +  1 


(5.4.139) 


[il'i 


and  note  that  if  |jx1(n)j|£C,  Vi,  then  V(y(n))  +  D  (n)<C^.  Then  pick  a  new 

constant  C.  such  that  inf  V(y)>C1.  Such  a  constant  exists,  by  (5.4.19). 

2  l|y||>Cj  _1 

Let  Hq  be  a  countable  dense  subset  of  H  and  choose  AQ,  A,  d ,ft',  as  in 
Lemma  5.4.5. 

Fix  some  u) eft*  and  assume,  to  derive  a  contradiction,  that 


lim  inf  [V(y(n))+D  (n) ]  =  b>0  . 


(5.4.140) 


We  also  define 


a' =  j  lim  inf  [  j  1 (y  (n) )  j  j +D  (n)  ]  . 
n-*°°  y 


(5.4.141) 


Using  Assumption  5.4.5  and  (5.4.140)  we  conclude  that  a>0.  Also  note  that 
b-^C^  and  let  a=min{l,a ' } . 

By  a  standard  compactness  argument,  we  can  see  that  there  exists  some 
yeHQ  and  a  sequence  n^  of  integers  such  that: 


I  M  li  c2  '  I  (nk}  l  I  —  C2  '  Vi,k 

I  Iy^J-yI  li  aA2,  Vk, 

110  (y(nk))“  0  (y)H-a'  vk' 

I  | v (y (n^) ) -V (y)  |  |<  niinU^Aja2, 

Vlyin^))  +  D2(nk)<  b  +  minCl.A^Aja2, 
1 10  2a  • 


(5.4.142) 


(5.4.143) 


(5.4.144) 


(5.4.145) 


(5.4.146) 


(5.4.147; 


If  n^  is  large  enough  (larger  than  the  constant  NQ  given  by  Lemma  5.4.5) 
then  (5.4.118)  holds  and  yields: 


V(y(m(nk,A)+l))  +  [D  (mtn^A) +1)  ] 2  <  V  (y)  +  D2  (r^)  - 

2 

-  min{l,A^A}  —  <_ 

^Vtyti^))  +  D2(n^)  +  ~  minUrA^A}  a2< 

<  b  -  mintl.A^}  a2  ,  (5.4.148) 

thus  contradicting  (5.4.140) .  □ 


To  complete  the  proof  of  the  Theorem,  we  may  assume  that 

lim  sup  [V(y(n))+D2(n) ]  =  a>0  (5.4.149) 

n-*» 

and  derive  a  contradiction.  The  argument  involved  is  identical  to  the 

corresponding  argument  of  Ljung  [1977a]  and  we  only  provide  an  outline. 

Given  any  a'e(0,a)  we  may  choose  an  integer  sequence  jn^|,  and  a  vector 

—  —1  — M  —  —  i 

(y,x  , . * • , x  )  with  yeHQ,  such  that  y(n^)  ia  close  to  y,  x  (n^)  is  close  to 

x^.Vtytn^J+D 

V(y(n))+D(n)  reaches  the  level  a-e  before  falling  below  a'-e,  where  e>0  has  been 
chosen  suitably  small.  There  are  two  cases  to  consider: 

(i)  If  the  level  a-e  is  reached  before  time  m(n^,A)+l  (where  A=/  AQ 
is  chosen  as  in  Lemma  5.4.5),  then  provided  that  a'  is  small  enough,  (5.4.96) 


(n^)  converges  to  a'  and  such  that,  after  time  n^,  the  quantity 


is  contradicted 
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(ii)  If  the  level  a-e  is  reached  after  time  m(n^»A)+l,  then  Lemma  5.4.5 
implies  that  V(y(n))+D(n)  should  first  fall  below  a'-e,  before  reaching  the 
level  a-e,  thus  contradicting  our  assumption. 

Therefore, 

lim[V(y (n) )+D(n) ]=0,  almost  surely,  (5.4.150) 

which  implies  lim  y(n)=x*,  lim  D(n)=0,  or,  equivalently,  lim  x1  (n)  =  x*,  Vi, 
n-*»  n-*»  n-*» 

almost  surely.  ■ 

Technical  Remarks  on  the  Proof  of  Theorem  5.4.1 

The  preceding  proof  has  certain  basic  similarities  with  the  proof  of 
Theorem  1  of  Ljung  [1977a] .  The  main  difference  is  that  instead  of  keeping  track 
of  the  evolution  of  one  state  vector  only,  we  also  need  to  keep  track  of  the 
magnitude  of  the  disagreement  between  processors,  which  complicates  greatly  the 
associated  inequalities.  So,  unlike  Theorem  5. '4. 2  that  follows,  the  preceding 
proof  differs  from  Ljung' s  in  non-trivial  ways. 

Another  difference  is  that  Ljung' s  Theorem  1  required  a  returning  condition 
on  the  process  {<j>  (t)  } ,  in  addition  to  the  returning  condition  of  the  state  of 
computation.  It  may  be  seen  in  Ljung 's  proof  that  this  condition  was  necessary 
only  for  algorithms  in  which  the  dynamics  of  the  process  {$ (t) }  were  influenced  by 
the  state  x(t)  of  the  algorithm.  Such  a  possibility,  however,  was  not  allowed 
by  us  and  the  returning  condition  on  {<£  (t) }  was  riot  needed. 

The  Associated  ODE 

The  results  of  Ljung  [1977a]  are  not  so  useful  for  rigorously  demonstrating 
convergence  of  stochastic  algorithms;  rather,  by  associating  an  ordinary  differen¬ 
tial  equation  (ODE)  with  such  an  algorithm,  they  provide  a  simple  heuristic 
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method  of  analysis  of  recursive  stochastic  algorithms.  In  fact  the  ODE  is  not 
essential  in  Ljung's  result.  The  mathematically  important  entity  is  the  Lyapunov 
function  associated  with  this  ODE.  (This  is  pointed  out  in  Appendix  V  of 
[Ljung  1977a]). 

For  this  reason,  our  result  has  been  formulated  exclusively  in  terms  of 

the  function  V,  without  reference  to  an  ODE.  Our  result  translates  easily  to  an 

ODE-type  result  if  we  assume  that  the  algorithm  is  (asymptotically)  time  invariant. 

1  m(n,A)  ^ 

More  specifically,  assume  that  -r  lim  £  —  f  (k,x)  exists,  for  j*ny  A>0,  and  is 

rrx»  k=n 

independent  of  A.  Let  us  denote  this  limit  by  f (x) ,  and  note  that,  by  Assumption 
5.4.5,  V  is  a  Lyapunov  function  for  the  ODE  x=f (x) .  Conversely ,  if  the  equation 
x=f (x)  is  globally  asymptotically  stable,  we  may  invoke  converse  stability  theorems 
to  construct  a  function  V  satisfying  Assumptions  5.4.5,  5.4.6.  Consequently,  the 
algorithm  will  converge  to  the  point  of  convergence  x*  of  the  ODE  x=f(x). 

Extensions 

Theorem  5.4.1  is  a  decentralized  analog  of  a  special  case  of  Theorem  1  of 
Ljung  [1977a].  Modifications  or  generalizations  of  this  result  cure  possible, 
along  the  lines  of  Ljung.  We  discuss  some  of  them  below: 

1)  We  may  assume  that  the  function  V  is  such  that  inequality  (5.4.20)  is 
only  satisfied  on  a  bounded  subset  of  H.  (In  the  ODE  language ,  the  equation 
x=*f (x)  has  a  bounded  domain  of  attraction.)  Then,  there  exists  a  subset  of 
such  that,  if  the  algorithm  returns  an  infinite  number  of  times  to  H and  if  A^, 
d  are  small  enough,  then  the  conclusion  of  Theorem  5.4.1  remains  valid.  (The 
proof  is  effectively  identical  with  that  of  Theorem  5.4.1). 


2)  Suppose  that  the  algorithm  is  asymptotically  time- invariant ,  so  that 
we  may  associate  an  ODE  x=f (x) .  Then,  under  certain  additional  smoothness  as¬ 
sumptions  (see  Theorem  2  of  Ljung  [1977a] )  we  may  prove  that  the  algorithm  may 
converge  only  to  stable  equilibrium  points  of  x*f (x) . 

The  final  and  most  important  extension  is  to  allow  the  dynamics  of  the 
process  (<|)  (t)  }  to  be  influenced  by  the  states  of  the  algorithm.  This  extension 
is  necessary  if  we  want  to  apply  the  results  to  algorithms  for  adaptive  control 
or  to  certain  system  identification  applications  [Ljung,  1977a] .  For  technical 
reasons,  we  have  not  proved  convergence  under  the  assumption  that  the  algorithm 
returns  infinitely  many  times  to  a  bounded  region.  We  use  the  stronger  assumption 
that  the  processors  come  arbitrarily  close  to  agreeing,  infinitely  many  times. 

We  also  assume  that  the  process  of  communicating  and  combining  is  faster  than  the 
algorithm.  Then,  whenever  the  processors  are  very  close  to  agreeing,  the  algorithm 
behaves  (up  to  first  order)  as  a  centralized  algorithm  and  the  centralized  results 
may  be  recovered  by  replicating  the  proof  of  Ljung.  The  result  that  follows, 
although  important  from  the  point  of  view  of  applications,  is  not  significantly 
different  -  mathematically  -  from  Theorem  1  of  Ljung.  For  this  reason,  we  only 
give  an  outline  of  the  proof.  Moreover,  we  formulate  the  Assumptions  and  the 
result  in  the  language  of  Ljung,  so  that  a  direct  comparison  may  be  made. 

Let  z (n)  denote  (xX(n) , . . . ,xM(n) ) ,  which  is  therefore  an  element  of  H*1 . 

Also,  for  any  xeH,  let  z(x)  denote  the  M-tuple  (x,x,...,x). 
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Assumption  5.4.8 

a)  The  steps  s1  (n)  are  given  by  Q1 (n.x1 (n) ,4»(n) ) ,  where  the  process 
{(|>  (n.) }  is  generated  by  $(0)  =  e(0)  and 

<j>(n+l)  =  A(z(n))<|>(n)  +B(z(n))e(n)  .  (5.4.151) 

Here  {e(n)}  is  a  sequence  of  independent  random  vectors,  defined  on  an 
underlying  probability  space  (Q,F,P)  and  such  that 

sup  E(|  |<j>(n)  |  IP]<°°  ,  Vp>l  (5.4.152) 

n 

Let  O  =  {zeHM:A(z)  is  asymptotically  stable}.  Let  D  CH  be  open  and  connected 
and  let  denote  the  Cartesian  product  of  DR  with  itself,  M  times.  He  assume 
that  A  ( • )  and  B  ( • )  are  Lipschitz  continuous  on  D_  and  that  D„CD  . 

K  t>  S 

Let  X(z)<l  be  such  that 

||A(z)k||<  CX(z)k,  Vke  N  .  (5.4.153) 

For  any  zeD^,  X<1,  c>0,  we  define  the  stochastic  processes  <J>(t,z),  u(t,X,c) 

K 

by 

$(t,z)  =  A(z)$(t-l,z)  +  B (z) e (t) ;  $(0,z)=0,  (5.4.154) 

u(t,X,c)  *  Xu(t-l,X,c)  +  c|e(t)|;  u(0,X,c)*0.  (5.4.155) 

(So,  <j>(t,z)  is  the  process  to  be  obtained  from  (5.4.151)  if  z(n)  was  "frozen" 
to  a  value  z;  also,  if  X*X(z)  and  ||b(z)||<  c,  then  u(t,X,c)  is  a  bound  for 
<t>(t,z) .) 

Let  8(x,p)  denote  a  (closed)  ball  of  radius  p  around  some  point  x.  He 
assume  that,  for  any  function  p:  H*(0,«), 


I  |Q1(t,x*  ,<p’ )  -Q1  (t,x",$")  |  |<  (x  /  4>  f  p  (x)  ,  u)  {  |  |x'-x"|  |  +  |  |)  (5.4.156) 


for  any  u>0,  x',  x",  $'*.  satisfying  x'e8(x,p(x) ) fi  DR  , 

x"e8(x,p(x) ) n  dr,  $’<?8(4>,u),  $"e8($,o) . 

Moreover,  the  function  satisfies 

(Xg<p  1 ,  p ,  u  * )  -  K  (x,$",p,u")  |<  K2  (x,<j>,p,u,w)  . 
{|  |<HT|  |  +  |u'-U-|} 


(5.4.157) 


whenever  $'eB($,w) ,  <J>"e80|>,w) ,  u'eB(u,w) ,  u"e8(u,w) . 

c)  For  any  xeDR,  the  random  variables  Q1  (t,x,(J>  (t,z  (x) ) ) , 

(x,<j)  (t,z  (x) )  ,p  (x) ,  u(t,X,c))  and  K2  (x,$  (t,z  (x) )  ,p  (x)  ,  u(t,X,c),  u(t,X,c))  have 
bounded  p-moments  for  all  p>l,  and  all  X<1,  c<°°. 

d)  The  step-size  y (n)  is  equal  to  1/n.  Also,  conanunication  delays  are  zero 
(or,  more  generally,  bounded)  and  for  any  A>0, 


lira  max  |  | (n | R> (Jc)  |  |  =0  (5.4.158) 

]c-k*>  n>m(k,A) 

(That  is,  the  process  of  communicating  and  combining  is  faster  than  the 
natural  time  scale  of  the  algorithm.  This  is  the  case,  in  particular,  if 
Assumptions  5.2.1,  5.2.3,  hold.) 

e)  lim  E[$^(t) (t,x,<j>(t,z  (x) )  ]  exists,  for  any  xeo  and  is 
tr*«o 

M 

denoted  by  f1 (x) .  We  let  f (x)  *  £  f 1 (x) . 

i=l 

Theorem  5.4.2:  Let  Assumption  5.4.8  hold.  Let  D  be  a  compact  subset  of  D„ 

'  K 

such  that  the  trajectories  of  the  ODE 


x-f(x) 


(5.4.159) 


that  start  in  D  remain  in  a  closed  subset  D'  of  D  .  Assume  that  (5.4.159) 

R 

has  an  invariant  set  D  with  domain  of  attraction  D, DO.  Also  assume  that 

c  A 

there  exists  a  constant  C,  such  that,  for  almost  all  (deft,  there  exist  sequences 

<P 

{r^},  {ek>  (n^eiN,  n^+1  >11^,  e^> 0,  lim  £^=0)  such  that  yCr^jeD, 

k->00 

I  |<|>  (n^)  |  \<_  C. ,  DCn^K  E^.  (Recall  that  Dtn^)  =  max  |  | x1  (n^)  — x3  (r^)  |  | -) 

9  i,j 

Then  x(t)-»-  D  ,  with  probability  one,  as  t-*°°. 
c 


Proof  (Outline,  following  Ljung) : 

Step  1;  Fix  some  y€I>s.  Let  p=p(y)  be  small  enough  and  suppose  that  for  some  n 

we  have:  y (n)eB(y,p(y) ) ,  D(n)<p(y),  |$(n)|<  C^.  Fix  some  small  enough  A>0  and 

m(n,A)  £  ± _  _ 

assume  that  n  is  large  enough  so  that  £  r-  $  (k)Q  (Jcry,4>  (k,z  (y) ) -Af  (y) 

k=n 

cam  be  made  as  small  as  desired.  Then,  provided  that  p,n,A  have  been  appropriately 
chosen,  we  may  conclude  that  x1 (k)6B(y,2P) ,  Vi,  Vke[n,m(n,A) ] .  It  then  follows  that 
<t>(k)s  <|>(k,z(y))  amd,  consequently,  y  (m(n,A) )»  y  (n)  +  Af(y).  Also,  with  n  large 
enough,  D(m(n,A))  becomes  arbitrarily  small,  due  to  (5.4.154). 

Step  2:  Let  V  be  a  Lyapunov  function  for  the  ODE  (5.4.155) .  If  the  algorithm 

returns  infinitely  often  to  the  vicinity  of  some  yeD  (with  V (y)  fO)  and  if  at  those 

times  D (n)  is  small  amd  | $ (n) |  <_  ,  we  use  the  results  of  Step  1  to  show  that  the 

algorithm  also  returns  to  the  vicinity  of  some  y'eD  ,  with  V(y')<  V(y),  infinitely 

often  amd  that,  at  those  times,  D(n)  is  small  amd  { (n)  |  <_  .  Proceeding  in  this 

manner,  we  conclude  that  lim  inf  (V(y(n) )+D(n) ]*0. 

n-*» 


Step  3:  If  lim  V(y(n))^0,  it  means  that  V(y(n))  must  "upcross"  an  interval 
n-*» 

[a,a'j  (with  0<a<a')  infinitely  many  times.  Choosing  a  and  a*  small  enough,  a 
version  of  Ljung's  argument  leads  to  a  contradiction. • 

Remarks ; 

1.  Note  that  Theorem  5.4.2  assumes  that  the  algorithm  returns  infinitely 
often  to  an  appropriate  region  (as  in  Theorem  5.4.1)  but  also  that  the  disagree¬ 
ment  at  those  returning  times  is  arbitrarily  small.  This  condition  on  the 
disagreement  may  be  enforced  if  once  in  a  while  each  processor  communicates  to 
every  other  processor  and  they  all  combine  using  the  same  coefficients;  in  other 
words,  disagreement  is  explicitly  eliminated,  once  in  a  while.  More  natural 
conditions  are  also  possible. 

2.  The  ODE  approach  cannot  be  used  to  prove  global  convergence  of  an 
algorithm.  However,  certain  algorithms  with  correlated  noise  are  known  to 
converge,  most  notably  the  ELS  algorithm  for  system  identification  [Solo,  1979] 
However,  global  convergence  of  decentralized  versions  of  such  algorithms  does  not 
follow  automatically  and  has  to  be  verified  by  other  means.  Moreover,  general 
convergence  results  do  not  seem  possible;  rather,  small  classes  of  algorithms  must 
be  studied  separately . 


3-  In  Theorem  5.4.2  we  have  assumed  that  the  decentralized  algorithm 


is  asymptotically  time  invariant ,  so  that  an  ODE  may  be  attached  to  it. 
This  is  not  necessary  and  the  result  may  be  stated  directly  in  terms  of 
Lyapunov  function  V.  (See  Assumptions  5.4.5,  5.4.6  and  their  discussion, 
as  well  as  Appendix  V  of  Ljung  [1977a] ) . 

4.  Finally,  the  proof  when  delays  are  bounded  (instead  of  zero) 
is  the  same,  provided  that  the  appropriate  state  augmentation  has  been 
made,  and  the  returning  condition  is  appropriately  modified. 


5.5  APPLICATIONS  IN  SYSTEM  IDENTIFICATION 

Many  recursive  stochastic  algorithms  for  on-line  system  identification 
are  pseudo-gradient  algorithms  or,  more  generally,  recursive  stochastic 
algorithms  of  the  type  considered  by  Ljung  [1977a] .  For  a  review  of 
such  algorithms  and  their  convergence  properties,  the  reader  may  refer 
to  [Ljung,  1981;  Goodwin  and  Payne,  1977;  Astrom  and  Eykhoff,  1971]. 
Consequently,  it  should  be  ejected  that  our  results  of  Section  5.3  and 
5.4  may  be  used  to  study  the  convergence  of  decentralized  identification 
schemes.  An  interesting  related  issue  is  the  problem  of  how  to  decompose 
(decentralize)  an  identification  algorithm,  so  that  the  resulting  scheme 
is  a  reasonable  and  attractive  alternative  to  the  corresponding  centralized 
algorithm.  In  this  section,  we  present  and  discuss  a  few  possibilities. 


Let  2  (t)  ,  z  (t)  be  autoregressive  moving  average  processes  described 


z^t)  +  a1z1  (t-l)+. .  .+a  z1  (t-n)  = 
l  n 

=  b.u1  (t)  + — +  b  u1 (t-m)  +  w1 (t) ,  i  *1,2, 
0  m 


(5.5.1) 


where  w^lt) ,  w2  (t)  are  zero-mean  white  noises  and  u* (t)  is  an  input 
process  known  by  processor  P1 ,  i=l,2.  Processor  P1  also  observes  at 
time  t  the  output  z1  (t)  and  tries  to  estimate  the  parameters  a^  and  b^ 
of  the  process.  Note  that  we  are  assuning  that  the  parameters  a^ 
and  b^  cure  the  same  for  both  processes,  that  is,  they  do  not  depend  on  i. 
We  may  let  q  be  the  backward  shift  operator  and  rewrite  (5.5.1)  in 
transform  notations 


Afqjz^t)  =B(q)u1(t)  +w1(t). 


(5.5.2) 


where 


A(q)  =  1  +  a,q+...+a  q  , 
1  n 


(5.5.3) 


B (q)  =  b.  +  b_q+. . . +b  q  . 
u  i  m 


(5.5.4) 


The  two  processors  are  to  cooperate,  by  exchanging  messages,  in 
identifying  the  unknown  parameters.  (see  Figure  5.5.1).  An  interesting 
special  case  of  the  above  configuration  is  depicted  in  Figure  5.5.2,  in 
which  A(q)  *  1  and  the  input  processes  u^(t),  u2 (t)  coincide.  This  is 
the  case  of  two  processors  obtaining  noisy  observations  of  the  output 
of  the  same  (moving  average)  process. 
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The  first  possibility  in  the  above  setting  is  to  let  each  processor 
use  any  of  the  standard  ASMA  identification  algorithms,  in  isolation, 
without  exchanging  any  messages  at  all.  Then  (provided  that  certain 
identifiability  conditions  are  satisfied  [Astrom  and  Eykhof f ,  1971]), 
each  processor  will,  in  the  limit,  recover  the  true  values  of  the 
parameters . 

It  is  clear,  however,  that  better  estimates  may  be  constructed  (i.e. 

convergence  may  be  faster)  if  the  two  processors  cooperate  by  exchanging 

seme  information.  Moreover,  it  is  possible  that  none  of  the  inputs 

u  ,  u  is  rich  enough  to  allow  either  processor  to  identify  all  parameters 

alone,  but  that  the  parameters  may  be  accurately  identified  on  the 

basis  of  both  sets  of  measurements. 

A  trivial  alternative  is  to  let  processor  P1  transmit  all  its 

2 

measurements  to  processor  P  ,  as  they  are  being  observed,  and  then  have 
2 

processor  P  compute  a  "centralized"  estimate.  This  would  require,  however, 
an  excessive  amount  of  conmunications . 

Another  alternative  would  be  to  proceed  along  the  lines  of  Chapter 
4,  whereby  at  each  instant  of  time  each  processor  computes  an  estimate 
which  is  optimal  given  its  information,  including  the  messages  it  has 
received.  This  would  be,  however,  computationally  hard,  in  general: 
even  if  all  random  variables  sure  Gaussiaiv  processor  P1  would  be  faced 

with  a  nonlinear  estimation  problem  unless  it  could  directly  measure 

2  2 
u  .  (This  is  because  the  estimate  of  processor  P  at  any  time  is  a 

2 

nonlinear  function  of  u  .)  If  the  inputs  were  commonly  observed,  then 
each  processor  would  face  a  linear  problem  at  each  time  (this  is  the  LQG 
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setting  of  Section  4.4)  ,  but  still  not  an  easy  one,  except  for  the  very 

special  case  in  which  communication  delays  are  zero  and  the  noises  w1 , 

2 

w  are  independent.  In  the  latter  case,  the  results  of  Willsky  et  al. 
[1982]  could  provide  us  with  an  optimal  updating  rule. 

So ,  there  is  a  clear  trade-off  between  optimality  and  the  complexity 
of  the  scheme.  An  approach  that  lies  in  the  middle,  and  which  we  will 
consider,  is  to  let  the  processors  combine  their  own  estimates  with  the 
messages  received  (estimates  of  the  other  processor)  by  simply  forming 
a  convex  combination,  according  to  the  model  proposed  in  Section  5.2. 

We  now  continue  with  a  more  detailed  discussion.  Let 


x*  -  (aj,  a2 , . . .  ,  an;  b^b^. . .  ,bffl)T 


(5.5.5) 


be  the  vector  unknown  parameters  and  let 


Tjj*(t)  *  (-zi(t-l)  ....  ,-zi(t-n);  u1  (t)  ,. . .  ,ui  (t-m)  )T, 


i=l ,2  .  (5.5.6) 

Then,  (5.5.1)  becomes 

z^t)  -  ^(t)Tx*  +  wX(t)  ,  i=l ,2 ,  (5.5.7) 

and  tp1  (t)  is  a  vector  which  is  known  by  processor  i  at  time  t. 

The  simplest  algorithm  to  be  considered  is  the  LMS  algorithm  [Ljung, 
1981]:  in  the  absence  of  any  communications,  processor  i  forms  the 
estimate  x1 (t)  recursively,  as  follows: 


x^t+l)  *  xA(t)  +  |  ^(t)  (z^t)  -  ^(tjVft))  ; 


T„i 


(5.5.') 


in  the  notation  of  Section  5.2,  Y  (t)  *  1/t  and 


S*(t)  =  <T(t)  (zx(t)  ■  f(t)  X  (t)) 


=  -  Xx(t))  +  ^(tJwNt). 


We  also  consider  the  NIXS  algorithn: 


x1  (t.l)  .  xl(t)  - JfcJIL. -  (t)TxL  (t) ) 

*  1 1 4*1  (t)  1 1 2  +  e 


for  which 


(5.5.8) 


C5.5.9) 


si(t) , 


l^(t) 


(x*  -  x1  (t) )  + 


ll)1  (t) 


+  e 


IkNt) 


W1  (t) 


(5.5.10) 


+  e 


where  e>0  is  some  constant  introduced  to  avoid  division  by  zero. 

Let  F  be  the  a-field  generated  by  ty1  (0)  ,. . .  ,\l>1  (t) ;  i»l,2},  and 
introduce  the  cost  function 

J(x)  =  -j(x-x*)T  (x-x*)  (5.5.11) 

We  assume  that  E  [w'’’  (t)  |  F^J  »  0,  i»l,2;  so  w^(t)  is  uncorrelated  with  past 
values  of  w* ,  as  well  with  past  values  of  u .  We  then  have,  for  the  LMS 
algorithm: 

E[<fx  s1  (t)>|  Ffc]  - 

-  (ft1  (t)-x*)TE[^i  (t)^1  (t)T  (x*  -  x1  (t) )  +  (t)w1  (t)  |  Ftl  - 

-  -(xi(t)-x*)V‘(t)i|>1(t)T(a1(t)-x*)  <0  (5.5.12) 


Similarly,  for  the  NIMS  algorithm 


E[<f^(x  (fc)),  s1(t)>|Ft]  = 


ll^1(t)||2  +  £ 


(x^tj-x*)  , 


(5.5.13) 


so  that,  for  either  algorithm,  the  pseudogradient  Assunption  5.3.3  is 
satisfied. 

Moreover,  in  either  of  the  following  two  cases: 

(i)  For  the  IMS  algorithm  and  with  ip1  (t)  a  process  whose  sample 

paths  are  bounded  by  some  A>0  (so  that  sample  paths  of  u1 (t)  and  w1 (t) 

* 

are  also  bounded) , 

(ii)  For  the  NIMS  algorithm  and  with  wx{t)  a  process  with  finite  second  moment 


It  is  easy  to  see  that 


Et  |  |  sX  (t)  |  |  2]  <  AEfJtfNt))]  +  B,  Vi,t 


(5.5.14) 


for  some  constants  A,B,  so  that  Assunption  5.3.5  is  also  satisfied. 

Concerning  the  process  of  communications ,  we  assune  Assunptions  5.2.1,  5.2.3 
and  5.2.4.  We  are  also  assuming  that  all  components  of  x1  are  equally 
weighted  when  combining  the  estimates  of  two  processors;  therefore,  the 
matrices  ^^(klttand  (t)  of  Section  5.2  cure  actually  scalars. 

It  follows  from  Theorem  5.3.2  that 


CO  2 

V  1  V  *i  f.  %  , 


l  ~  l  V  (t)(x  (t)  -  x*)  Q1  (t)  (x  (t)  -  x*) 
t*l  i-1 

almost  surely,  where  Qx(t)  is  defined  as  follows: 


(5.5.14) 


(i)  (IMS)  Q1  (t)  -  «l>1(t)^1(t)T 


Naturally,  we  assume  that  the  system  (5.5.1)  is  stable. 


(5.5.15) 
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(ii)  (NLMS)  QX(t)  =  ■  ■ 

||^(t)|  |2+e 


(5.5.16) 


/\  1  /\2 

We  also  know  fran  Theorem  5.3.2  that  x  (t)  -  x  (t)  converges  to  zero  and 
that  (x1  (t)  -  x*)T(x1(t)  -  x*)  converges  to  seme  value,  almost  surely, 

9 

for  each  i.  However,  the  preceding  do  not  imply  yet  that  x1 (t)  converges 
to  x*.  For  this  we  need  some  identifiability  condition;  that  is,  the 
inputs  ux(t)  must  be  sufficiently  rich.  As  an  example,  we  present  a 
convergence  result,  very  similar  to  results  for  centralized  identification 
in  which  we  also  summarize  the  preceding  discussion. 

Theorem  5.5.1:  Let  the  processes  z^t)  be  obtained  from  (5.5.1)  and 
assume  that 


E[wi(t)|Ftl  =0,  s\ip  E[|wi(t)  |2]  <  00 


(5.5.17) 


Let  Assumptions  5.2.1,  5.2.3,  5.2.4  be  satisfied.  Consider  either  the  LMS 
algorithm  (with  |  1 1};1  (t)  |  [  bounded  by  some  constant)  or  the  NI/4S  algorithm 
and  assume  that  there  exist  constants  a€3N ,  B>0  such  that 


k+a  2 

E[  l  l  QX(t)  |F  ]  >  Bl,  Vk  e  IN,  a.  s. 
t-k  i=*l  K 

where  1  is  the  identity  matrix.  Then, 


(5.5.18) 


lim  x  (t)  -  x*,  i-1,2, 

almost  surely. 


(5.5.19) 


Proof:  From  the  proof  of  Theorem  5.3.2,  we  obtain 

lim  max  BlU^OO  -xj(t)||2]  -0,  i,j»l,2 

Jew  k<t<k+a 


(5.5.20) 


Moreover,  Theorem  5.3.2c  yields 


lim  inf  E[ (x1 (t)-x*)TQ1 (t) (x1 (t)-x*) ]=0,  i«l,2  (5.5.21) 

t-+°° 

Combining  (5.5.20)  vrith  (5.5.21)  and  using  the  fact  that  Q1  (t)  is  bounded 
we  obtain 

lim  inf  max  E I (x1 (k)-x*)T$ j (t)Qj (t) (x1 (k)-x*) ]  =0  (5.5.21) 

k+°°  k<t<k+a 


and,  consequently, 

2  k+a 

lim  inf  £  £  E[ (&1 (k)-x*)T  Q3 (t) (x1 (k)-x*) ]  *  0  (5.5.22) 

k-**>  j=l  t=k 

and ,  using  Fatou*  s  Lama , 


lim  inf  (*x(k)-x*r  E 
k^o° 


'  2  k+a  .  "1 

l  l  Q3(t)|F  (x1 

_j*l  t*k 


(k)-x*)=0  , 


(5.5.23) 


which,  in  view  of  (5.5.18)  implies  that 


lim  inf||x1(k)  -  x* | | 2  *  0.  (5.5.24) 

k~» 

On  the  other  hand,  by  Theorem  5.2.3a,  | |x1(k)-x*| |2  converges,  which 

a! 

shows  that  x  (k)  converges  to  x*,  almost  surely. ■ 


We  now  continue  with  the  same  example  and  consider  the  RLS  (recursive 
least  squares  algorithm) .  In  this  algorithm  each  processor  evaluates 
recursively  (in  the  absence  of  communications)  an  auxiliary  quantity  (a 
matrix)  RX(t)  [Ljung,  1981]: 

R1  (t+1)  *  R1  (t)  +  £  tip1  <t) tp1  (t)T  -  R1  (t)  ]  (5.5.25) 

and  updates  its  estimate  according  to 
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x1  (t+1)  =  21  (t)  +  ^[R1(t+l)]“V(t)[*X(t)  -  (t)1*1  (t)  1  . 


(5.5.26) 


This  algorithm  is  not  a  pseudo-gradient  algorithm,  at  least  when  the  cost 
function  is  J(x)  =  ||x-x*||2,  because  [Rx (t+1) ]  V^(t)^X(t)T  is 
not  necessarily  nonnegative  definite,  even  though  it  is  the  product 
of  two  non-negative  definite  matrices.  It  may  be  analyzed,  however,  via  the  ODE 

approach  as  in  Ljung  [1977a] .  Let  us  assixne  the  followings 

(i)  Assumptions  5.2.1,  5.2.3,  5.2.4.  (The  time  scale  of  the  combining  process 
is  faster  than  the  natural  time  scale  of  the  algorithm. ) 


(ii)  All  entries  of  R  and  all  components  of  x  are  combined  using 
the  same  weights,  so  that  $x(t)  is  actually  a  scalar  for  each  i,t.  More¬ 
over,  $x  ■  lim  $x(t)  exists,  for  i*=l,2. 

t*» 

(iii)  For  each  i,  the  input  ux  (t)  and  the  noise  wx  (t>  are  stationary 
and  independent  stochastic  processes.  In  particular,  w1 (t)  is  zero- 
mean,  white,  with  all  moments  finite.  Also,  uX (t)  is  the  output  of  a 
linear  time  invariant,  asymptotically  stable  system,  driven  by  some  i.i.d. 
noise  ex(t)  which  has  finite  moments. 


fX  (x)  -  E  WX  (t)  (y1  (t)  -  ^  (t)  Tx)  ] 


(5.5.27) 


GL  -  E [ip1  (t)  l])1  (t) T] 


(5.5.28) 


(The  expectations  do  not  depend  on  t ,  because  of  stationarity. ) 


For  the  case  of  only  one  processor,  the  associated  ODE  was  found  by 


Ljung  to  be: 


x(t)  *  R  i(t)f(x(t)) 


(5.5.29a) 


ft(t)  =  G (x (t) )  -  R (t)  (5.5.29b) 

It  follows  that  the  ODE  associated  to  the  decentralized  algorithm  with 
two  processors  is: 

x (t )  =  R_1(t)  ($1f1(x(t))  +  $2f2(x(t)))  (5.5.30a) 

ft(t)  *  $1G1  +  $2G2  -  R (t)  (5.5.30b) 

Let 

x(t)  =  (x*-x(t)),  (5.5.31) 


where  x*  is  the  vector  of  true  parameters .  Then, 

fX(x)  =  GXx  ,  (5.5.32) 

so  that  (5.5.30)  becomes 

x (t )  *  R_1  (t)  (^G1  +  $2G2)x  -  R_1(t)Gx  ,  (5.5.33a) 

ft(t-)  -  (^G1  +  <I>2G2)  -  R (t)  =  G  -  R (t)  (5.5.33b) 

where 

G  -  ^G1  +  •V  .  (5.5.34) 

Introducing  the  function 

V (x,R)  *  xTRx  ,  (5.5.35) 


we  can  easily  check  that  this  is  a  Lyapunov  function  which  proves  that 
the  ODE  (5.5.33)  is  asymptotically  stable,  with  domain  of  attraction 
{ (x,R) :R>o}  and  converges  to  the  point  (x,R)  *  (0,G),  provided  that  G>0. 


The  condition  G>0  effectively  requires  that  the  input  processes  u  (t) 


2 

and  u  (t)  are  "sufficiently  rich"  so  that  all  parameters  may  be  identified. 
Note  that  if  we  have  Gx>0,  for  some  i,  then  processor  i  would  be  able 
to  identify  all  unknown  parameters  by  itself.  However,  it  is  conceivable 
that  G*  and  G2  are  singular,  but  G1  +  G2  >  0  (so  that  $^G*  +  <f2G2  >  0)- 
In  that  case,  no  processor  may  identify 

the  parameters  by  himself,  but  the  two  together  can.  The  condition 
(5.5.18)  introduced  in  Theorem  5.5.1  is  a  similar  "joint  identifiability" 
condition. 

We  have  effectively  shown  that  the  distributed  RLS  algorithm  converges 
appropriately  under  certain  ass unptions ,  including  the  assumption 
G>0.  However,  this  result  is  valid  under  the  assunption  that  the  algorithm 
returns  an  infinite  nunber  of  times  to  a  bounded  region.  This  latter 
assumption  has  to  be  verified  using  different  means,  but  we  conjecture 
that  it  holds. 

A  further  issue  which  suggests  itself  is  the  problem  of  choosing 

12  ' 

$  ,  $  so  as  to  maximize  the  speed  of  convergence  of  the  algorithm. 

Given  that  the  ODE  (5.5.33)  is  an  approximate  description  of  the 

asymptotic  behavior  of  the  algorithm,  the  question  becomes:  given  G1, 

2  1  2 

G  what  are  the  choices  of  $  ,  $  which  maximize  the  rate  of  convergence 
of  the  ODE  (5.5.33)? 

Example  2:  Two  Processes  Driven  by  a  Common  Colored  Noise. 

1  2 

Consider  two  stochastic  processes  z  (t) ,  z  (t)  which  are  generated 
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Here  A1 (q) ,  B1(q)  are  polynomials,  q  is  the  unit  delay  operator,  u~(t) 
is  an  input  stochastic  process  and  w(t)  is  a  common  noise  process.  We 
assume  that  w(t)  is  not  white  and  is  generated  according  to 


w(t)  =  C(q) v(t)  , 


(5.5.37) 


where  C  is  a  monic  polynomial  and  v (t)  is  a  white  process.  (See  Figure 
5.5.3).  Without  loss  of  generality,  we  assxme  that  the  polynomials 
A1(*) ,  B1(*) ,  C(*)  all  have  the  same  degree,  which  is  denoted  by  k. 

In  the  single  processor  case,  it  is  well-known  that  ms  or  RLS 
algorithms  lead  to  biased  estimates.  Consistent  estimates  may  be  obtained, 
however,  if  a  processor  also  tries  to  identify  the  parameters  of  the 
polynomial  C(q).  One  such  algorithm,  the  Extended  Least  Squares 
algorithm  has  been  shown  to  be  globally  convergent  and  a  returning  condition 
is  not  required  [Solo,  1979J.  (Some  confusion  may  arise  due  to  the 
fact  that  different  authors  use  different  names  to  denote  the  same 
algorithm.  For  example,  what  we  call  ELS,  is  called  AML  by  Solo.) 

For  the  example,  in  Figure  5.5.3,  either  processor  could  try  to 
identify  A1 (q) ,  B1 (q) ,  C(q)  by  itself.  However,  since  they  both 
identify  the  conmon  noise  source  C (q)  they  might  benefit  by  exchanging 
and  combining  the  estimates  of  the  coefficients  of  C (q) •  We  then 
obtain  the  following  algorithm  (this  is  the  stochastic  approximation 
version  of  ELS) : 


9q  “  (c j , . . . , c^ ) 


(5.5.38) 


9*  -  (a*,...,a£,  bj,...,b£),  i-1,2 


(5.5.39) 


be  the  vectors  of  true  parameters  to  be  identified.  Note  that  processor 


i  is  interested  in  identifying  just  the  vector  (0*,  9*).  Let 

i  0 

Q1  (n)  =  (9^(n),  §g(n))  be  the  estimate  of  processor  i  at  time  n.  In 
the  absence  of  any  communications ,  the  computations  performed  by 
processor  i,  at  time  n,  are  the  following: 


Processor  i  has  already  computed 
ip1  (n)  =  (-zi  (n-1)  , . . .  ,-2*  (n-k) 


u1  (n)  .  jU1  (n-k)  , 

n1  (n-1)  ,. . .  , rj1  (n-k) ) .  (5.5.40) 


.  He  vpdates  6 


i 


by 


e1^)  =  @Nn-l)  +  i  (n)  (y1  (n)  -  iJ/lT  (n)  01  (n-1) ) 


(5.5.41) 


He  computes  the  residual 

n1(n)  »  y(n)  -  iplT  (n)  0 1  (n) 


and  uses  it  to  compute  if;1  (n+1) 


(5.5.42) 


In  the  absence  of  communications,  we  obtain  a  different  ODE  for  each 
processor.  Namely: 


®l(t)  * 

®J(t» 


(5.5.43) 


and  similarly  for  processor  2.  In  the  presence  of  communications  of  the 
values  of  §*,  let  use  assume  that  the  Assumptions  of  Theorem  5.4.2  are 
satisfied  and  that  S’1  (t)  =  ♦1  (independent  of  time) .  We  then  obtain 


a  joint  ODE,  which  is: 
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91(t)  =  f1(61(t)  ,  0Q  (t) ) 


02(t)  =  f2(02(t)  ,  0Q  (t) ) 


(5.5.44) 


0Q(t)  =  *1f10(§1(t)  ,  0Q(t))  +  $2f20(®2(t)  '  ®0(t)) 


Ljung  [1977b]  has  shown  that  the  ODE  (5.5.43)  associated  with  the  centralized 
ELS  algorithm  is  asymptotically  stable  (and  all  required  conditions  are 
satisfied)  provided  that  the  positive  real  condition 


Re  C(e10)  >0,  V0  e  [0,2TT] 


(5.5.45) 


is  satisfied.  Moreover,  stability  may  be  demonstrated  using  the  Lyapunov 
function 


vi(0i,0o)  =  1 1 0^0?!  I2  +  l|e0-e$ll2 


(5.5.46) 


Even  though  (5.5.43)  is  stable,  for  i*l,2,  it  does  not  seem  to  follow  that 
(5.5.44)  is  stable  as  well.  For  this  reason,  we  introduce  a  small 
modification.  Namely,  we  assume  that  processor  i  updates  the  first 
2k  components  of  9  ,  using 


01  (n)  -  eNn-l)  +  —  ^  (n)  (y1  (n)  -  i]>lT  (n)  01  (n-1) )  , 
n 


(5.5.46) 


instead  of  (5.5.41).  The  remaining  k  components  (those  corresponding 
to  C(q))  are  updated  according  to  (5.5.41).  Then,  the  ODE  associated 
to  the  decentralized  algorithm  becomes: 


0x(t)  =  ®1f1(01(t) ,  ©0  (t) ) 


02(t)  -*2f2(92(t)'0O<t)) 


90{t)  "  Vl0(§l(t)'  90(t))  +  $2f2O(02(t)'  90(t)) 


(5.5.47) 


This  is  a  convex  combination  of  two  stable  ODE's  with  a  common  Lyapunov 


The  results  of  Section  5.3  on  deterministic  pseudo- gradient  algorithms 
(Theorem  5.3.1)  show  that,  given  a  bound  on  the  time  between  consecutive 
communications  and  on  the  communication  delay,  we  can  find  a  small 
enough  step-size  so  that  the  algorithm  is  convergent.  Moreover,  by 
tracing  the  steps  in  the  proof  of  Theorem  5.3.1,  it  is  possible  to 
actually  evaluate  a  bound  y*  on  the  step-size,  to  ensure  convergence. 

We  may  also  follow  a  reverse  approach:  given  a  step-size,  we  may 
evaluate  a  bound  on  the  time  between  consecutive  communications  and 
communication  delays,  so  that  the  algorithm  converges.  However,  the 
bounds  to  be  so  obtained  are  not  necessarily  tight,  much  of  the  structure 
of  the  problem  may  be  lost  and,  finally,  they  may  be  not  particularly 
illuninating. 

In  this  section  we  impose  more  structure  on  the  nature  of  the 
optimization  problem  and  the  algorithm  under  study,  so  as  to  obtain  more 
specific  results.  The  conceptual  motivation  behind  this  new  approach  is 
based  on  the  following  statement  which  seems  reasonable,  at  least  on  an 
intuitive  basis,  from  a  normative  perspective: 


If  an  optimization  problem  consists  of  subproblems,  each 
subproblem  being  assigned  to  a  different  agent  (processor)  , 
then  the  frequency  of  communications  between  a  pair  of  processors 
should  reflect  the  degree  by  which  their  respective  subproblems 
are  coupled  together. 

The  above  statement  is  fairly  hard  to  capture  mathematically.  We 
believe  that  this  is  accomplished,  at  least  to  some  degree,  by  the  model 
and  the  results  of  this  section.  An  extensive  conceptual  discussion  of 

these  results  may  be  found  in  Section  5.7. 

M 

Let  J:3R  -*-[0,oo)  be  a  cost  function  to  be  minimized,  which  has  a  special 
structure: 

M 

J(x)  ~  J(x^,«..  )  3  J  J  (x^,...  ,3^ ),  (5.6.1) 

i=l 

i  M 

where  J  :IR  ->[0,«>1  So  far,  equation  (5.6.1)  does  not  impose  any  restriction 
on  J;  we  will  be  interested,  however,  in  the  case  where,  for  each  i, 

J1  depends  on  x^  and  only  a  few  more  components;  consequently,  the 
Hessian  matrix  of  each  J1  is  sparse. 

We  view  J1  as  a  cost  directly  faced  by  processor  i.  This  processor 
is  free  to  fix  or  qpdate  the  component  ,  but  its  cost  also  depends  on 
a  few  interaction  variables  (other  components  of  x)  which  are  under  the 
authority  of  other  processors. 

We  may  visualize  the  structure  of  the  interactions  by  means  of  a 
directed  graph  G  *  (V ,E) : 

(i)  The  set  V  of  nodes  of  G  is  V  ■  {l,...,M}. 

(ii)  The  set  of  edges  E  of  the  graph  is 

E  -  {  (i ,  j ) :  J3  depends  on  x. }  . 


(5.6.2) 


For  example,  the  graph  is  Figure  5.6.1  corresponds  to  a  cost  function 

12  3 

of  the  form  J  (x^  +  J  (x  ,  x2 ,  x3>  +  J  (x2,  x3>  • 

Since  we  are  interested  in  the  fine  structure  of  the  problem,  we 
quantify  the  interactions  between  subproblems  by  assuming  that  the 
following  bounds  are  available: 


32J 

<  xr 

?20* 

3x. 3x. 
i  3 

S  Jv.  .  9 

-  13 

3x. 3x . 

i  3 

<  Kk .  ,  Vx  e  3RM 
ID 


(5.6.3) 


and  note  that  K. .  can  be  chosen  so  that 
M 

K..  <  I  Kk.  I 

13  ~k=l  13 

A  centralized  gradient-type  algorithm  for  minimizing  J  is  the 
following: 


(5.6.4) 


“i  hTj 

A?  (n)  =  (x  (n) )  , 


*i  (n+1)  =  xi(n)  -  Yi  ^  l  A3(n)j 


(5.6.5) 

(5.6.6) 


The  sunmation  is  (5.6.6)  needs  to  be  carried  out  only  for  those  j's 
such  that  (i,j)  €  E,  because  otherwise  X? (n)  is  zero.  This  has  to 
be  kept  in  mind  when  considering  implementations  of  the  algorithm.  The 
above  algorithm  may  be  implemented  in  a  synchronous  decentralized  way 
as  follows: 

Algorithm  1: 

1.  For  each  (i,j)  e  E,  processor  i  evaluates  A3 (n)  according 


to  (5.6.5). 


2.  Each  processor  i  uses  (5.6.6)  to  update  x^. 

3.  For  each  i,j,k  such  that  (i,j)  e  E,  (k , j )  e  E,  processor 
k  transmits  to  processor  i. 

It  is  more  interesting,  however,  to  assume  that  the  functional 
form  of  J3  is  stored  only  in  the  memory  of  processor  j  and  no  other 
processor  "knows"  ,  i.e.  the  input  is  itself  decentralized.  (Such  a 
configuration  has  lower  memory  requirements) .  In  such  a  case,  it  is 
more  meaningful  to  let  the  first  step  in  Algorithm  1  be  executed  by 
processor  j  (instead  of  processor  i)  and  then  have  processor  j  transmit 
the  result  to  processor  i.  We  then  obtain  the  following: 

Algorithm  2: 

1.  For  each  (i , j )  €  E,  processor  j  evaluates  X^  (n)  according 
to  (5.6.5). 

2.  For  each  (i,j)  €  E,  processor  j  transmits  X^  (n)  to  processor  i. 

3.  Each  processor  i  uses  (5.6.6)  to  update  x^. 

4.  For  each  (i,j)  e  E,  processor  i  transmits  x^ (n+1)  to 

processor  j. 

Algorithm  2  is  mathematically  equivalent  to  the  centralized  algorithm 
(5.6.5)  -  (5.6.6)  but  has  certain  drawbacks:  a)  There  are  strict 
synchronization  requirements;  b)  If  there  is  some  pair  (i,j)  such  that 
communications  from  i  to  j  are  particularly  slow,  the  operation  of  all 
processors  has  to  be  slowed  down.  For  these  reasons  we  are  interested  in 


an  asynchronous  version  of  Algorithm  2  which  tolerates  communication 
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delays.  In  fact,  the  algorithm  we  present  below  may  also  allow  us  to  reduce 

the  number  of  transmitted  messages  per  stage.  The  specific  advantages 

are  discussed  in  detail  in  section  5.7. 

We  let,  as  usual,  (n)  *  (x*(n) ,. .. ,x^(n))  denote  the  x-vector 

stored  in  the  memory  of  processor  i  at  time  n.  We  also  assume  that  each 

1  M 

processor  i  stores  in  its  memory  another  vector  (A  (n)  ,. . .  ,A^  (n) )  with 
3J1  3jM 

its  estimates  of  -r —  ,...  ,  ~ -  .  We  do  not  require  that  a  message  be 

Cj  x .  d  x  ( 

l  l 

transmitted  at  each  time  stage  and  we  allow  communication  delays.  So, 
let 

p  (n)  =  the  time  that  a  message  with  a  value  of  x^  was  sent 

from  processor  k  to  processor  i ,  and  this  was  the  last  such 
message  received  no  later  them  time  n. 
k  * 

q  1  (n)  =  the  time  that  a  message  with  a  value  of  was  sent 

i 

from  processor  k  to  processor  i,  and  this  was  the  last 
such  message  received  no  later  than  time  n. 


For  consistency  of  our  notation  we  let 

p11 (n)  =qli(n)  =  n,  Vi,  Vn. 

With  the  above  definitions,  we  have: 

x£(n)  -^(p^W),  Vn,  V(k,i)  €  E, 

Xi(n)  "It-  (xk  (qki  (n) ) )  ,  Vn,  V(i,k)  e  E. 
i 

Equations  (5.6.8),  (5.6.9)  together  with 
.  M 

x*(n+l)  »  x*(n)  -  Ya  I  A^(n) 


(5.6.7) 


(5.6.8) 


(5.6.9) 


(5.6.10) 


specify  completely  the  asynchronous  decentralized  algorithm  to  be 
studied. 

As  in  Theorem  5.3.1  (for  the  specialization  case) ,  we  assume  that 
the  time  between  consecutive  communications  and  the  communication  delays 
are  bounded.  However ,  we  allow  the  bounds  to  be  different  for  each  pair 
of  processors  and  type  of  message: 

ik  ik 

Assumption  5.6.1;  For  some  constants  P  ,  Q  , 


_ik 
n  -  P 

,  ik.  .  . 

<  p  (n)  <  n. 

V(i,k)  e  E, 

Vn, 

(5.6.H) 

ik 

ik 

n  -  Q 

<  q  (n)  <_n. 

V (k,i)  e  E, 

Vn  . 

(5.6.12) 

\  y  ik 

Note  that  we  may  let  P  =  Q  =  0,  to  recover  a  synchronous  algorithm. 

The  result  that  follows  states  that  the  algorithm  converges,  if  the 
time  between  consecutive  communications  plus  communication  delays 
between  any  two  processors  is  not  too  1  surge  compared  to  the  degree  of 
coupling  of  their  subproblems. 


Theorem  5.6.1;  Suppose  that  for  each  i 


_  M  MM 

—  >  I  •  + 


,ik  „kj  „jk  „ki, 

5  j.  A  ^  x  O  *'  x  A  l 


i  j.i  «  W  j-l  13 


i  l  K*.(Pik  +  Q"J  +  PJ~  +  Q"~) 


(5.6.13) 


12  M 

Let  z(n)  *  (x*(n),  x*  (n) ,. . .  ^(n) ) .  Then 


3J 


lim  (z  (n) )  *  0,  Vi. 

n-*^  i 


(5.6.14) 


M 


Proof:  Let  s. (n)  *  J  A*(n)  and  note  that 

k=l 


x*(n+l)  »  x*(n)  -  y  s  (n)  . 


(5.6.15) 


From  a  Taylor  series  expansion  for  J  we  obtain: 


J(z(n+1))  <  J(z(n))  -  j  y.  » —  (z(n))s.(n)  + 
—  . i  3x.  1 

i=l  1 


M  _  M 


+  2  2  yll  l  Vls^n)!2  < 

i=l  j=l  3 
M  M 

<  J(z(n))  -  l  Yi|si(n)|2  +  £  yjl^-  (z  (n) )  -  s±  (n)  |  •  |  s±  (n)  | 


(5.6.16) 


Using  (5.6.9), 


||^-(z(n))  -  s±  (n)  J  =|  £  |^~  (z  (n) )  -  A*(n)  |  < 

i  k=l  xi  J  ~ 

<  l  l|~  <*(n)>  -  |~  (^(q^fn)))!  < 

k=l  i 


M  M 


<  l  l  |x^(n)  -  xNq^fn))!  . 
k=l  j=l  3  3  3 


(5.6.17) 


Now  note  that 


I  xj  (n)  -  x^q^fn))!  *  |x3(n)  -  x3  (p3  *  (q*1  (n) ) )  |  < 


3  ,J*/  *i. 


i  i  ki  jk  Yj I s3 (m3  I 

m-n-Q-P3*  3 


(5.6.18) 


(fence. 


M  M  /  M 

J(z(n+1))  <  J(z(n))  -  £  y|s(n)|2+|  J  y2  (  £  K.  .  )|sx  (n)  | ; 

i=l  1  1  i=l  1  \j«l  l3/ 


M  MM,  n-1 


+  l  VL  l  l  l  JH  ikyJs.<n)|.|s  (n)|  (5.6.19) 

i-1  1  k=l  j=l  13  Bwi-Oki-Pjk  3  3  1 
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Using  the  inequality 


YiY  j  I  Si  (n)  I  1 3  j  (“)  I  —  \  (Yi|si(n)  |2  +  Y^Sj  (m)|  ^)  , 


(5.6.20) 


and  stunning  (5.6.19)  for  different  values  of  n,  we  obtain 


n  M 


J  (z  (n+1) )  <J(z(0))  -  l  l  Y.|s.(Jl)j  + 

1=0  i=l  1  1 

n  M  /  M 

+  I  l  I  Ki j ) I s1  oi) | 2 

1=0  i*l  2  1  M=1  l3/ 


n  M  M  M  2,-1 

,i  i  i  i  i  Kij  i  . 

2  t-0  i.l  k-1  j-1  m.(,-Qki-p’k  2  ' 


(5.6.21) 


Note  that 


i  Y  w-  ,k(Yi|sia)|2  +  Yiisi(m)i2) 
1=0  11  33 


=  (P3k  +  Qkl)  l  Y^|s.(i)  |2  +  l  l  Y^|s,(m)|2< 

n  /«  A  _ a  n  _ _  i  1  J 


,  n-l  nwQki+P2k  , 

1 2  .  r  r  2 1 


m=0  £=ro+l 


<  (P3k  +  Qkx)  l  [yIIs  (Z)I2+Y*Is  (Z)I2] 
0_r»  1  J  J 


(5.6.22) 


We  now  use  (5.6.22)  to  rewrite  the  last  iterated  sum  in  (5.6.21)  as 


Using  (5.6.23)  in  (5.6.21)  and  collecting  terms,  we  have 


J(z(n+1))  <  J(z(0>)  +  I  l  |s.  (£)r  -Y.  +  -  YT  l  K  + 

i=Oi=l  1  L  j=l  13 

+  Y2  l  l  Kk.  (Pjk  +  Qkl  +  Plk  +  Qkj)l  (5.6.24) 

2  1  k=l  j=l  13  J 

He  now  use  (5.6.13)  and  conclude  that 

l  |s.  U)l  2  <  AJ(z(0))  ,  Vi,n,  (5.6.25) 

1=0  1 

We  may  finally  use  inequalities  (5.6.17)  and  (5.6.18)  to  obtain 
(5.6.14)  .s 

We  close  this  section  with  a  few  remarks: 

1.  The  bounds  provided  by  (5.6.13)  are  sufficient  for  convergence 
but  they  are  not  tight,  nor  necessary.  In  particular,  the  first 
inequality  in  (5.6.16)  is  not  tight,  even  if  J  is  a  quadratic  function, 
because  it  does  not  take  into  account  the  signs  of  the  entries  in  the 
Hessian  matrix  of  J.  It  is  also  conceivable  that  inequality  (5.6.20) 
could  be  improved  by  taking  into  account  the  fine  structure  of  the 
problem. 

It  is  known  [Bertsekas,  1983]  that  a  decentralized  algorithm  of  a 

similar  type  may  converge  in  certain  special  cases,  even  if  the 

xfc  ik 

y. ' s  are  held  fixed  while  the  bounds  P  ,  Q  are  allowed  to  be 
arbitrarily  large.  So,  the  gap  between  the  sufficient  conditions 
(5.6.13)  and  the  necessary  conditions  may  be  substantial.  Further 
research  might  narrow  this  gap. 

2.  The  convergence  rate  of  the  decentralized  algoritlm  should  be 

ik  ik 

expected  to  deteriorate  as  the  bounds  P  ,  Q  increase.  A  characterization 
of  the  convergence  rate,  however,  seems  to  be  a  fairly  hard  problem. 


5.7  TOWARDS  ORGANIZATIONAL  DESIGN 


We  have  argued  earlier  that  a  central  theme  i  jsign  of 

decentralized  systems  is  the  question  "who  should  r  ..icate  to  whom, 
what  and  how  often".  In  this  section  we  discuss  our  results  (mainly 
those  of  section  5.6)  in  the  context  of  the  above  question. 

It  has  been  often  suggested  that  the  behavior  of  a  boundedly 
rational  huaan  decision  maker  (or  economic  agent)  may  be  modeled  as  a 
descent-type  iterative  algorithm.  Proceeding  along  the  same  lines,  we 
may  view  distributed  descent  (e.g.  pseudo- gradient)  algorithms  as  a 
model  of  adjustment  or  as  a  behavioral  model  in  a  human  organization. 
This  coincides  with  the  models  of  Arrow  and  Hurwicz  [19601  for  adjustment 
in  economic  markets  and  of  Meerkov  [1979]  for  societies  of  animals. 

Other  algorithms  from  mathematical  programming  (e.g.  the  Dantzig-Wolfe 
decomposition)  have  been  proposed  as  models  of  the  budget-allocation 
process  in  a  divisionalized  company. 

Making  the  assunption  that  an  iterative  algorithm  describes  the 
behavior  of  the  members  of  an  organization,  let  us  now  place  ourselves 
in  the  position  of  the  designer  of  an  organization.  His  task  is  to 
create  a  chart  prescribing  the  flow  of  information  between  the 
decision-makers.  Let  us  assume  that  the  objective  of  the  organization 
is  the  minimization  of  a  certain  organizational  cost  which  is  the  sum 
of  the  costs  faced  by  each  division.  To  each  division,  there  corresponds 
a  decision-maker  who  is  knowledgeable  enough  about  the  structure  of  the 
problem  he  is  facing,  to  the  extent  that  given  a  tentative  decision 
(or  operating  point)  he  is  able  to  change  his  decision  in  a  direction 
which  results  to  improvement.  We  also  assune  that  the  divisions  are 
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interacting  in  some  way;  so  that  the  decisions  of  one  decision  maker 


may  affect  the  costs  of  another  division.  It  follows  that  a  decision 
maker  who  is  interested  in  the  well-being  of  the  entire 
organization  needs  to  take  into  account  two  types  of  information,  coming 
from  other  divisions: 


(i)  "What  everybody  else  is  doing".  (Because  their  decisions 
affect  what  is  a  "good"  decision  for  my  subproblem. ) 

(ii)  "How  do  my  decisions  influence  than".  (That  is,  I  need  to 
take  into  account  the  side-effects  of  my  decisions  on  other  divisions) . 


Moreover,  since  the  organization  is  assuned  to  undergo  a  process  of 
change  and  adjustment,  this  information  has  to  be  updated ,  from  time 
to  time,  so  as  to  keep  up  with  the  changes  that  have  occured..  However, 
this  updating  occurs  in  a  fairly  asynchronous  manner,  because  it  is 
unreasonable  to  assune  that  decision  makers  in  an  organization  make 
changes  in  a  synchronized  manner  and  that  they  inform  each  other  each 
time  they  make  a  change. 

It  should  be  clear  that  the  above  described  setting  closely  resembles 
the  algorithm  studied  in  Section  5.6.  In  particular ,  messages  of 


component  from  processor  i  carry  the  information  of  "what 
everybody  else  is  doing"  and  messages  X*  carry  the  information  of 
how  processor  j  influences  the  i-th  division. 


Having  established  this  relationship  between  the  organizational 


problem  and  the  mathematical  description,  we  proceed  to  the  main 
question,  "who  should  communicate  to  whom,  what,  how  often".  We  address 


each  component  of  this  question  separately: 


makers  should  communicate  if  and  only  if  their  divisions  interact  in 


some  way.  Mathematically,  whether  two  divisions  interact  or  not  may 
be  seen  from  the  graph  G  that  was  constructed  in  Section  5.6.  Messages 
should  flow  along  the  edges  of  that  graph. 

(ii)  What  should  there  be  communicated?  This  is  also  an  easy  question 
(at  least  in  our  framework).  If  <i,j)  6  E  (that  is  influences  J3) , 
decision  maker  i  should  inform  j  of  his  decisions  and  j  should  "reply" 
to  i  how  he  is  affected  by  these  decisions. 


(iii)  How  often  should  they  communicate?  It  is  intuitively  obvious 
that  the  frequency  of  communications  between  two  decision  makers  should 
be  proportional  to  the  degree  of  coupling  between  their  subproblems. 

While  this  statement  is,  in  general,  hard  to  make  mathematically  precise, 
it  may  be  effectively  studied  in  the  framework  that  we  have  introduced: 
the  "degree  of  coupling  between  divisions"  is  quantified  in  terms  of 
the  bounds  on  the  second  derivatives  of  the  costs;  the  frequency 
of  conmuni cat ions  are  also  quantified  in  terms  of  the  bounds  P1^,  Q1'1. 
Finally,  these  are  coupled  together,  via  the  sufficient  conditions  of 
Theorem  5.6.1,  which  prescribes  the  frequency  of  communications  in  terms 
of  interconnection  strengths  and  the  speed  of  adjustment 

We  may  summarize  the  above  discussion  by  saying  that  (for  the  above 
asstxned  model  of  behavior  in  cooperative  organizations) ,  the  approach 
of  Section  5.6  may  be  used  to  design  an  organizational  structure. 


that  is,  prescribe  the  nature  of  communication  flows.  It  must  be 
pointed  out,  however,  that  Theorem  5.6.1  defines  an  entire  admissible 


set  of  organizational  structures.  Choosing  a  particular  element  of 
this  set  requires  more  study  of  the  effect  of  the  bounds  P1^ ,  Q1^  on 
the  convergence  rate  of  the  algorithm. 

One  final  remark  on  organizational  design.  It  is  conceivable  that 
the  structure  of  the  optimization  problem  slowly  changes  with  time, 
and  so  do  the  bounds  ,  although  in  a  time  scale  slower  than  the 
time  scale  of  the  adjustment  process.  In  such  a  case,  the  bounds  P1] , 

Q13  should  also  change.  This  leads  to  a  natural  two- level  organizational 
structure:  At  the  lower  level,  we  have  a  set  of  decision  makers 
continuously  adjusting  their  decisions  and  exchanging  messages.  At  a  higher 
level,  we  have  a  supervisor  who  monitors  changes  in  K__  and  accordingly 
instructs  the  low-level  decision  makers  to  adjust  their  communication 
rates.  Note  that  the  supervisor  does  not  need  to  know  the  details  of 
the  cost  function;  he  only  needs  to  know  the  degree  of  coupling  between 
the  divisions.  This  seems  to  reflect  the  actual  structure  of  existing 
organizations.  Low  level  decision  makers  are  "experts"  on  the 
problems  directly  facing  them,  while  higher  level  decision  makers  only 
know  certain  structural  properties  of  the  overall  problem  and  make 
certain  global  decisions,  e.g.  setting  the  communication  rates. 

Another  possibility  is  to  let  the  low-level  decision  makers  monitor 
the  couplings  K^_.  and  accordingly  adjust  the  communication  rates  by 
themselves,  without  wating  for  instructions  from  above.  This  essentially 
amounts  to  another  decentralized  algorithm  (superimposed  on  the  origi¬ 
nal  one)  which  aims  at  solving  the  set  of  inequalities  (5.6.13)  of 


Theorem  5.6.1. 


A  further  topic  for  research  relevant  to  organization  theory  is 


the  following:  given  the  mathematical  framework  we  have  introduced, 
are  there  any  generic  properties  of  particular  organizational  structures? 
For  example,  if  we  have  a  hierarchy  (that  is  the  graph  G  of  section  5.6 
is  a  tree)  does  the  adjustment  process  have  any  specific  properties 
which  are  structurally  different  than  those  for  general  graphs? 

5.8  OTHER  POSSIBLE  APPLICATIONS 


Routing  in  Communication  Networks 

The  problem  of  (quasi-static)  routing  of  messages  in  a  communication 
network,  may  be  formulated  as  a  nonlinear  programming  problem,  whose 
objective  is  to  minimize  some  performance  criterion  related  to  the 
delays  in  the  transmission  of  messages  through  the  network  [Cantor  and 
Gerla,  1974].  There  are  plenty  of  reasons  for  which  one  would  like  to 
have  this  problem  solved  in  a  decentralized  way.  Certain  decentralized 
algorithms  have  been  proposed  [Gallager,  1977;  Gafni  and  Bersekas,  1983] 
but  most  often  convergence  is  proved  under  an  assumption  of  strict 
synchronism,  which  is  undesirable  and  unrealistic.  Our  approach  may  be 
used,  however,  to  prove  convergence  under  more  realistic  assumptions , 
tolerating  a  fair  amount  of  asynchronism.  A  small  modification  of 
our  results  and  proofs  would  be  needed,  however,  to  handle  problems 

with  inequality  constraints,  because  such  constraints  are  present  in 

f 

communication  network  problems. 

As  a  more  specific  application  one  may  consider  the  algorithm  of 

Gafni  and  Bertsekas  [1983]  for  routing  in  communication  networks 

utilizing  virtual  circuits.  This  algorithm  admits  a  convenient  on-line 

t  A  result  of  this  type  has  been  recently  obtained  for  a  gradient  projection 
algorithm  [Tsitsiklis  and  Bertsekas,  1984]. 


decentralized  implementation  and,  under  certain  assumptions,  yields 


approximately  optimal  routing  strategies,  even  within  the  class  of 
dynamic  strategies.  We  strongly  conjecture  that  the  same  is  true 
for  the  asynchronous  version  of  this  algorithm. 

Parallel  Computation 

Many  architectures  for  parallel  computation  have  been  proposed 
recently,  ranging  from  general  purpose  multi -processors  to  special 
purpose  VLSI  architectures .  Together  comes ,  of  course ,  an  effort  to 
develop  good  parallel  algorithms  that  suit  the  available  architect  tires. 

Consider  an  optimization  problem  with  the  structure  assixned  in 
Section  5.6  and  suppose  that  we  have  available  a  multi-processor 
architecture:  each  processor  is  asstmted  to  be  capable  of  storing  one 
of  the  additive  terms  of  the  cost  function  and  evaluate  its  partial 
derivatives  at  a  given  point.  We  also  assume  that  the  processors  are 
arranged  in  a  graph  which  coincides  with  the  one  induced  by  the  structure 
of  the  cost  function  (see  Section  5.6). 

Given  this  architecture,  we  considered,  in  Section  5.6,  both 
synchronous  and  asynchronous  algorithms  for  the  same  problem,  which  we 
now  compare: 

The  main  disadvantages  of  the  synchronous  algorithm  are  the  following: 

(i)  Synchr oni sm .  This  requires  a  synchronization  protocol  which  may 

introduce  substantial  overhead  and,  therefore,  be  practically  undesirable. 


(ii)  Bottlenecks  caused  by  communication  delays.  If  there  is 


some  pair  of  processors  (i,j)  such  that  the  delay  of  messages  transmitted 


-2. 


from  i  to  j  is  excessively  large  (compared  to  the  delay  of  other  messages) 
then  all  processors  should  remain  idle  until  the  message  is  received  by 
processor  j ,  and  only  then  could  they  proceed  to  the  next  round  of 
computation.  Therefore,  the  speed  of  the  algorithm  is  determined  by 
the  largest  communication  delay  and  processors  may  have  to  remain  idle 
a  large  fraction  of  time. 

(iii)  Excessive  Communication  Requirements .  At  each  stage  of  the 
algorithm,  a  message  must  be  transmitted  along  each  arc  in  the  associated 
graph.  This  may  be  problematic,  especially  if  communications  capacity  is 
a  scarce  resource ,  as  is  often  the  case  in  VLSI  architectures.  Moreover, 
if  many  messages  are  routed  using  a  common  bus,  large  delays,  and  there¬ 
fore  bottlenecks,  may  result. 

We  now  indicate  how  all  the  above  mentioned  problems  are  alleviated 
by  the  asynchronous  algorithm  that  we  introduced  in  Section  5.6:  First, 
it  has  no  synchronization  requirements.  Moreover,  the  proof  of  Theorem 
5.6.1  may  be  easily  modified  to  handle  the  case  in  which  processors 
are  allowed  not  to  update ,  once  in  a  while  (see 

Corollary  5.3.1.)  This  could  capture  the  possibility  that  certain 
processors  perform  computations  faster  than  others,  or  that  their 
individual  clocks  do  not  run  at  the  same  speed.  Second,  and  most 
important,  our  asynchronous  algorithm  allows  communication  delays  and 
infrequent  communications,  without  creating  bottlenecks:  If  processor 


i  has  not  received  the  value  of  some  component  that  it  needs  for  its 
own  computations,  it  keeps  computing  using  a  value  for  that  component 
it  had  received  some  time  in  the  past.  So,  it  will  update  slightly 
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in  the  wrong  direction,  but  this  may  be  substantially  better  than  not 
updating  at  all. 

Of  course,  there  is  a  certain  tradeoff:  if  communications  become 
too  infrequent  and  delays  too  large,  while  the  step-size  is  held 

9 

constant,  the  convergence  rate  of  the  algorithm  worsens  or  the  algorithm 
need  not  converge  at  all.  It  is  an  open  research  topic  how  to  handle 
this  tradeoff.  It  is  intuitively  clear  -  at  least  if  the  Hessian 
matrix  of  the  cost  function  is  diagonally  dominant  in  some  sense  and 
communication  delays  are  not  too  large  -  that  the  asynchronous  algorithm 
will  be  faster  than  the  synchronous  one,  but  more  research  is  needed, 
including  numerical  experimentation. 

5.9  SUMMARY  OF  PROBLEMS  FOR  FUTURE  RESEARCH 

In  the  preceding  Sections  we  have  mentioned  or  alluded  to  certain 
problems,  related  to  the  ones  studied  in  this  chapter,  which  cure 
potential  topics  for  future  research.  In  this  Section  we  put  them 
together,  for  easier  reference,  and  to  place  them  into  perspective. 

Since  this  is  mainly  a  "classification"  section,  we  will  be  very  brief 
and  the  reader  should  consult  earlier  sections  for  more  details.  The 
problems  outlined  below  fall  into  three  general  categories:  a)  Simple 
modifications  of  results  in  this  chapter;  b)  New  research  directions 
of  general  (theoretical)  nature;  c)  Applications  in  specific  fields. 
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1.  Obtain  convergence  results  for  distributed  algorithms  for  constrained 
optimization,  in  a  setting  similar  to  that  of  Section  5.3. 

2.  Obtain  convergence  results  for  algorithms  in  which  there  are  no  bounds 

on  the  time  between  consecutive  communication,  but  communications  are 

» 

"event-driven":  a  message  is  transmitted  whenever  a  substantial  change  occurs. 
Compare,  theoretically  or  through  simulation,  such  algorithms  with  the  old 
ones,  as  well  as  with  the  corresponding  centralized  (synchronous)  algorithms. 

3.  Assume  that  the  cost  function  J  to  be  minimized  is  convex  and  see  whether 
something  special  may  be  said,  in  view  of  the  fact  that  the  combining  process 
consists  of  forming  convex  combinations. 

4.  Obtain  precise  results  on  the  convergence  rate  of  decreasing  step-size 
distributed  algorithms. 

5.  Modify  Theorem  5.4.1  so  as  to  be  valid  even  if  an  associated  ODE  has  only 
a  bounded  domain  of  attraction. 

6.  Prove  analogs  of  Theorems  2  and  3  of  Ljung  [1977a]  for  distributed 
algorithms . 

7.  Investigate  theoretically  and/or  with  simulations  the  convergence  rate  of 
the  gradient  algorithm  of  Section  5,6.  Also,  obtain  convergence  conditions 
tighter  than  those  of  Theorem  5.6.1  and  possibly  narrow  the  gap  between 
necessary  and  sufficient  conditions  for  convergence. 

8.  Run  simulations  of  distributed  identification  algorithms  to  evaluate  their 

performance,  especially  during  the  initial  transient  period.  Also,  find 

more  structures  for  distributed  identification  to  which  our  results  may  be 
applied. 
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9.  Use  results  on  algorithms  for  constrained  optimization  for  optimal 
routing  and  flow  control  in  data  communication  networks. 

10.  Apply  the  results  of  Section  5.6  to  problems  of  designing  the  pattern  of 
communications,  either  for  the  purpose  of  speeding  up  parallel  computation 

(in  the  context  of  Section  5.8)  or  for  the  purpose  of  designing  a  divisionlized 
organization . 

11.  Consider  two-level  organizations  in  which  the  higher  level  sets  the  rate 
of  communications,  while  the  lower  lovel  executes  a  distributed  optimization 
algorithm  (see  Section  5.7). 

12.  Examine  whether  certain  organizational  structures  (e.g.  hierarchies)  have 
any  specific  properties  which  reflect  themselves  on  properties  of  the  correspon¬ 
ding  distributed  algorithms  (see  Section  5.7). 


5.10  SUMMARY  AND  CONCLUSIONS 

A  broad  class  of  deterministic  and  stochastic  iterative  algorithms  admit 
asynchronous  decentralized  implementations  which  retain  the  desirable  convergence 
properties  of  their  centralized  (or  synchronous)  counterparts.  The  main  require¬ 
ments  for  convergence  are  that  the  time  between  consecutive  transmission  of  messages, 
as  well  as  communication  delays,  are  not  too  large,  when  compared  to  the  natural 
time  scale  associated  with  the  algorithm.  For  algorithms  with  decreasing  step- 
size,  this  allows  communications  to  become  more  and  more  infrequent  as  the 
algorithm  progresses. 

For  a  class  of  optimization  problems  consisting  of  a  set  of  coupled  sub¬ 
problems,  we  have  shown  that  communication  requirements  depend,  in  a  quantifiable 
way  on  the  degree  of  coupling  between  subproblems. 


Our  results  and,  more  generally,  our  line  of  approach  may  be  useful  in 
several  different  settings.  For  example,  in  parallel  computation,  in  organiza¬ 
tional  design,  in  decentralized  signal  processing  or  in  routing  for  data 
communication  networks. 

Some  of  the  advantages  of  asynchronous  algorithms  are:  there  are  no 
synchronization  requirements,  which  makes  implementation  easier;  bottlenecks 
to  the  speed  of  the  algorithm,  caused  by  communication  delays,  are  relieved; 
finally,  there  may  be  savings  in  the  total  number  of  exchanged  messages,  so 
that  overloading  of  communication  channels  is  avoided. 


CHAPTER  6:  A  GLOBAL  VIEW 


6.1  OVERVIEW 

The  results  in  this  report  have  been  already  reviewed  and  discussed,  at  several 
places.  Instead  of  an  additional  review,  we  discuss  in  this  Section  the  main  concep¬ 
tual  lines  which  link  together  the  various  pieces  in  our  study. 

What  our  results  have  in  common,  is  that  (almost)  all  refer  to  static  decision 
problems  (estimation  or  identification  problems  being  a  special  case) .  The  word  "static" 
"static"  is  used  here  to  distinguish  from  those  problems  in  which  we  are  interested  in 
decentralized  feedback  control  of  a  dynamical  system.  However,  even  though  the  under¬ 
lying  decision  problems  are  static,  time  enters  in  our  study  in  two  ways:  a)  It  may 
be  the  case  that  the  information  relevant  to  the  problem  is  not  obtained  all  at  once, 
but  sequentially;  so,  better  decisions  sure  being  evaluated  as  more  data  become  available; 
b)  even  if  all  data  become  available  at  once,  a  solution  to  the  decision  problem  need 
not  be  obtained  instantly ,  but  through  an  iterative  process  of  adjustment. 

We  first  considered  the  case  in  which  all  relevant  data  become  available  at  time 
zero  and  we  investigated  the  question  whether  decisions  may  be  evaluated  instantly, 
without  any  communications.  If  not,  we  exclude  the  possibility  of  centralizing  infor¬ 
mation  by  direct  transmission  of  all  data;  rather,  we  consider  sequential  schemes  for 
transfering  information  and  evaluating  decisions  (decentralized  protocols) .  Given  that 
optimal  protocols  are  hard  to  design,  we  consider  a  few  specific  (ad-hoc  chosen)  classes 
of  protocols  and  address  two  types  of  questions:  a)  What  happens  when  a  specific  protocol 
and  rule  for  updating  decisions  is  employed  and  b)  What  are  the  communication  require¬ 
ments  of  such  a  decentralized  scheme  in  order  to  guarantee  smooth  and  desirable  opera¬ 
tion  (e.g.  convergence) . 

The  class  of  protocols  that  we  have  studied  could  not  be  exhaustive.  Specific 
real-world  applications  might  require  drastically  different  ones.  However,  we  have 


focussed  on  fairly  general  and  universally  applicable  ones,  as  witnessed  by  the  wide 
range  of  possible  applications  we  have  suggested  in  earlier  Chapters. 

6.2  AN  ORGANIZATIONAL  VIEW 

Recall  that  in  Section  2.2  we  had  offered  an  abstract  and  schematic  picture  of 
the  operation  and  evolution  of  real  world  organizations.  We  suggested  there  that  such 
a  picture  could  guide  the  selection  of  research  topics.  We  now  indicate  the  correspond¬ 
ence  between  some  of  the  issues  raised  in  Section  2.2  and  our  results. 

In  small  and  simple  organizations ,  many  decisions  are  being  made  without  commu¬ 
nicating.  This  corresponds  to  the  problems  studied  in  Chapter  3.  We  have  seen, 
however,  that  silent  coordination  leads  to  hard  problems.  Consequently,  apart  from 
small  and  simple  decision  problems,  communications  become  necessary,  even  if  redundant. 
A  common  situation  in  which  a  set  of  decision  makers  have  to  communicate  is  when  they 
have  to  agree  on  something.  Chapter  4  effectively  addresses  such  situations. 

As  the  organization  grows,  it  ceases  being  optimal  and  some  decision  makers  start 
changing  their  mode  of  operation  so  as  to  improve  performance.  As  their  operation 
changes,  they  need-  and  do-  inform  those  decision  makers  who  should  be  concerned  about 
such  changes.  This  corresponds  to  a  decentralized  adjustment  process,  of  the  type 
studied  in  Chapter  5.  Finally,  as  the  organization  becomes  even  more  complex,  higher 
levels  of  decision  making  sure  introduced  who  know  who  should  be  concerned  about  what 
and  set  up  the  necessary  information  flows.  Such  two-level  schemes  remain  to  be  studie- 


in  the  future. 
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6.3  GENERAL  AREAS  FOR  FUTURE  RESEARCH 

Several  problems  which  are  closely  related  to  our  study  have  been  already 
suggested  in  the  main  body  of  this  report  and  will  not  be  repeated  here.  We  will  sug¬ 
gest,  however,  some  broader  directions  which  seem  to  be  of  particular  interest. 

From  a  mathematical  perspective,  the  problem  "who  should  communicate  to  whom, 
what,  etc."  may  be  viewed  as  a  problem  of  decentralized  complexity  theory.  A  lot  of 
progress  in  this  direction  is  being  made,  mainly  in  the  Computer  science  literature. 
Systems  theorists,  however,  are  more  often  concerned  with  continuous  rather  than  com¬ 
binatorial  problems.  This  leads  to  the  question  whether  a  decentralized  continuous 
complexity  theory  (possibly  along  the  lines  of  Nemirovsky  and  Yudin  [1983],  Traub  and 
Wozniakowsky ,  [1980])  is  feasible.  Some  interesting  issues  concern  the  convergence 
rate  of  decentralized  algorithms  and  the  tradeoff  with  the  number  of  communications. 
Given  the  possibility  of  coding  an  arbitrary  amount  of  information  in  messages  contain¬ 
ing  real  numbers,  this  line  of  research  will  probably  encounter  some  non- trivial  issues 
of  modelling  of  decentralized  computation. 

A  second  area  of  inquiry  relates  to  organizational  issues,  hierarchical  algorithms, 
whereby  the  higher  levels  dictate  the  information  flows,  or  event-driven  algorithms, 
in  which  communication  protocols  are  not  fixed  in  advance,  but  depend  on  the  state  of 
the  environment. 

A  third  area  of  inquiry  could  concern  the  nature  of  the  (implicit)  information 
exchange,  when  a  set  of  decision  makers  observe  the  actions  of  other  decision  makers, 
with  different  information.  This  would  be  an  extension  of  the  results  in  Chapter  4  to 
situations  in  which  the  objective  of  the  decision  makers  is  not  necessarily  to  reach 
consensus.  Links  with  game  theory  may  be  investigated,  especially  when  we  have  decision 
makers  possessing  different  models  of  the  situation. 


APPENDIX  A 


Proof  of  Lemma  5.2.1:  We  only  need  to  prove  the  Lemma  for  each  component 

separately,  since  (5.2.2)  corresponds  to  a  decoupled  set  of  linear  systems. 

We  may  therefore  assume  that  there  is  only  one  component;  so,  the  subscript  l 

will  be  omitted  and  <f>1‘,(n|k)  becomes  a  scalar. 

The  coefficient  (n|k)  being  the  impulse  response  of  the  linear  system 

(5.2.2)  is  determined  by  the  following  "experiment":  let  us  fix  a  processor  j 

h  h 

and  some  time  k;  let  x  (1)=0,  V h,  s  (m)=Q,  Vh,m,  unless  if  h=j  and  m=k  in 

which  case  we  let  Y^  (k)  s^  (k)  =v,  some  nonzero  element  of  H.  For  all  times  n  and 

for  all  processors  i,  x1 (n)  will  be  a  scalar  multiple  of  v.  This  proportionality 

factor  is  precisely  equal  to  (n|k)  .  Since  this  experiment  takes  place  in  a 

one-dimensional  subspace  of  H,  we  may  assume  -without  loss  of  generality-  that  H 

is  one-dimensional  and  that  v=l.  We  then  have,  for  the  above  "experiment", 

$1"l(n|k)  =  x1(n),  Vi,n.  A  trivial  induction  based  on  (5.2.2)  and  using  (5.2.3) 

shows  that  x1 ( n)>0,  Vi,n,  which  proves  (5.2.8). 

For  inequality  (5.2.9)  we  consider  a  different  "experiment":  let  us  fix 

some  time  k  and  let  x^(l)=0,  \/h,  sil(m)=0,  Vm,h,  unless  if  m=k  in  which  case 
li  h 

we  let  Y  00  s  (k)=l,  Vh.  (H  is  still  assumed  one-dimensional).  We  then  use 
(5.2.2)  and  (5.2.4)  to  show  by  a  simple  inductive  argument  that  x1(n)<^l,  Vi,n 
which  concludes  the  proof  of  part  (i) . 

For  the  proof  of  the  remaining  parts  of  the  Lemma  we  first  perform  a  reduc¬ 
tion  to  a  simpler  case.  It  is  relatively  easy  to  see  that  we  may  assume,  with¬ 
out  loss  of  generality,  that  communication  delays  are  nonzero.  For  example,  we 
could  redefine  the  time  variable  so  that  one  time  unit  for  the  old  time  variable 


corresponds  to  two  time  units  for  the  new  one  and  so  that  any  message  that 
had  zero  delay  for  the  original  description  has  unit  delay  for  the  new  descrip¬ 
tion.  If  any  of  the  Assumptions  5.2.1,  5.2.2,  5.2.3,  5.2.4  holds  in  terms 
of  the  old  time  variable,  it  also  remains  true  with  the  new  one. 

Next,  we  perform  a  reduction  to  the  case  where  all  messages  have  zero 
delay,  as  follows:  for  any  (i,j)6E,  we  introduce  a  finite  set  of  dummy  processors, 
between  i  and  j  (see  Fig.  A.l)  which  act  as  buffers.  Any  message  from  i  to  j 
is  first  transmitted  to  a  buffer  processor  (with  zero  delay)  which  holds  it  for 
time  equal  to  the  desired  delay  and  then  transmits  it  to  j ,  again  with  zero 
delay.  (So,  whenever  a  buffer  processor  h  receives  a  message,  it  lets 
hx 

a  (n)=l).  The  buffer  processor  which  is  to  be  employed  for  any  particular 
message  is  determined  by  a  round-robin  rule.  Note  that  for  each  pair  (i,j)6E 
the  number  of  buffer  processors  that  needs  to  be  introduced  is  equal  to  the 
maximum  communication  delay  for  messages  from  i  to  j ,  which  has  been  assumed 
finite.*  Note  also  that  the  buffer  processors  are  non-computing  ones  and  have 
in-degree  equal  to  1. 

It  is  easy  to  see  that  if  Assumptions  5.2.2,  5.2.3  are  valid  for 
the  original  description  of  the  algorithm,  they  are  also  valid  for  the  above 
introduced  augmented  description,  possibly  with  a  different  choice  of  constants. 
Assumption  5.2.1  also  remains  valid  except  for  part  c(iii)  which  may  be 
violated.  (If  in  Figure  A.l  processor  j  had  originally  in-degree  equal  to  1,  it 
now  has  in-degree  equal  to  4,  but  there  is  nothing  in  our  assumptions  that 
guarantees  that  a^  (n)>a,  Vn)  .  We  have,  nevertheless,  the  following  condition: 

*  This  procedure  is  equivalent  to  a  state  augmentation  for  the  linear  system 
(5.2.2),  as  discussed  in  Section  5.2.1. 
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Assumption  A.l:  For  any  processor  i  with  in-degree  larger  than  1,  either 

1.  Assumption  5.2.1c  (iii)  holds,  or 

2.  All  predecessors  of  i  are  non- computing,  have  in-degree  1  and  a  common 
predecessor. 

From  now  on,  we  assume,  without  loss  of  generality,  that  communication 
delays  are  zero,  provided  that  we  replace  Assumption  5.2.1c  (iii)  by  Assumption 
A.l. 

Let  A(n),  $(n|k)  be  the  matrices  with  coefficients  a1"1  (n)  ,  $1‘l(n|k), 

respectively.  Because  of  the  zero  delay  assumption  it  is  easy  to  see  that 

n“1  * 

<t>(n|k)  =  II  A  (m)  ,  n>k+l. 

m=k+l 

We  first  prove  the  desired  results  under  Assumptions  5.2.1  and  5.2.2. 

By  combining  Assumptions  5.2.1c (i)  and  5.2.2,  note  that  there  exists  a  constant 
B  such  that,  for  any  (i,j)6E  and  any  time  interval  I  of  length  B,  there  exists 
some  tei  such  that  a'|;L(t)>a>  0.  (These  are  times  that  i  communicates  to  j.) 

Let  us  fix  a  computing  processor  j  and  some  time  k.  By  relabelling,  let 
us  assume  that  j=l.  We  will  show  by  induction  (with  respect  to  a  particular 
numbering  of  the  processors)  that  for  any  processor  i,  there  exists  some  a^>0, 
independent  of  k,  such  that  $il(njk)>ai»  for  all  n€ [k+ (i-1) B+l,  k+MBl^l^. 

To  start  the  induction,  consider  first  the  case  i=l  and  notice  that 

$u(n|k)>  n  au(t)>  an"k_1>  a1®  =  a  >0,  Vner  . 

t=k+l  1  1 

♦Since  each  A(n)  is  a  "stochastic  matrix  (nonnegative  entries,  each  row  sums  to  1) , 
questions  of  convergence  of  $(n|k)  are  equivalent  to  questions  about  the  long- 
run  behavior  of  a  finite  (time-varying)  Markov  chain. 


Suppose  now  that  a  subset  S  of  the  processors  has  been  numbered 


for  some  i>2,  and  that  the  induction  hypothesis  has  been  proved 
for  all  processors  in  S.  We  show  that  it  is  always  possible  to  find  a  new 
processor  in  V\S,  rename  it  to  i  and  prove  the  induction  hypothesis  for  i  as 
well. 

Consider  the  set  Q  of  processors  q0S  such  that  (p,q)6E,  for  some  p6S. 

If  Q=<f>,  then  S  is  the  set  of  all  processors  (because  of  Assumption  5.2.1b)  and 
we  are  done.  If  not,  we  choose  one  processor  from  Q,  and  rename  it  to  i, 
subject  to  the  following  restriction:  we  choose  a  processor  with  in-degree  more 
than  one  only  if  no  processor  with  in-degree  equal  to  one  belongs  to  Q. 

We  now  prove  the  induction  hypothesis  for  processor  i.  Let  h6S  be  some 
predecessor  of  i,  belonging  to  S.  Then,  h<i  and,  by  the  induction  hypothesis, 
(n|k)>aii>0,  Vnei^DI^  ^ .  Moreover,  for  some  te  [k+ (i-2)  B+l, . . .  ,k+ (i-1)  B] 

ih  x  1 

we  have  a  (t)>a  and  consequently,  $  (t+1 ] k) >aa^. 

We  first  suppose  that  i  has  in-degree  1.  We  prove  by  induction  on  n,  for 

il  i  A 

ne[t+l, — , k+MB ] D  I .  that  $  (n  k)>aa.  =  a.>0.  Indeed, 

i  '  —  hi 


•I1"^  (n+1  |k)  =  a'*"h(n)4'hl(n|k)  +  a**  (n)  (n|k)>^ 

>_  rain{$hl  (n|k)  ,$i;L  (n|k)  }>  min  {01^,00^}  =  otah  . 

Suppose  now  that  i  has  in-degree  more  than  1  and  that  Assumption  5.2.1c (iii) 


holds .  Then , 


$ll(n|k)> 


-  -1 

II  aii  (m)  (t+l|k)>otn  tah>aMBah  =  oc>0,  Vnei^  . 

_m=t+l 
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The  last  possibility  is  that  i  has  in-degree  more  than  1  (hence  all 
processors  in  Q  have  in-degree  more  than  1)  and  Assumption  5.2.1c (iii)  fails. 

Then  the  set  of  predecessors  of  i  (denoted  by  U)  has  a  single  common  predecessor, 
denoted  by  j.  Since  i€Q,  some  heu  must  belong  to  S.  Since  any  heu  is  non¬ 
computing,  we  have  h^l  and  its  predecessor  j  must  belong  to  S.  Now,  for  any  p€U, 
p  does  not  belong  to  Q  (since  it  has  in-degree  1)  and  therefore,  pes.  We  conclude 
that  all  predecessors  of  i  belong  to  S  (UCS) •  We  now  perform  an  easy  induction 
on  n,  for  n&  [t+1 , . . .  ,k+MB]  to  show  that  $^(n|k)>a  min  {a,  }  =  a.>0.  Indeed, 

heu  h 

<J|il  (n+l  |k)  =  23  aih(n)$hl(n|k)>^  min  $hl(n|k)>_ 

heuU{i}  heuu{i} 

>  min{a. ,  min  {a,  }}  =  a. 

1  heu  h 

This  completes  our  inductive  argument. 

We  may  now  conclude  that  $ (k+MB | k)  is  a  stochastic  matrix  with  the 
property  that  all  entries  in  some  column  (corresponding  to  any  computing  processor) 
are  positive  and  bounded  away  from  zero  by  a  constant  ot>0  which  does  not  depend  on 
k.  We  combine  this  fact  with  Lemma  A.l  below  to  conclude  that  the  assertions  of 
Lemma  5.2.1  are  true. 

Lemma  A.l;  Consider  a  sequence  {Dn}  of  nonnegative  matrices  with  the  properties 
that: 

(i)  Each  row  sums  to  1. 

(ii)  For  some  a>0  and  for  some  column  (say  the  first  one) ,  all  entries  of  Dn, 
for  any  n,  in  that  column  are  larger  or  equal  than  a. 


a)  D  =  lim  II  D  exists. 

n-*x>  k=l  K 

b)  All  rows  of  D  are  identical. 

c)  The  entry  in  the  first  column  of  D  is  bounded  below  by  a. 

d)  Convergence  to  D  takes  place  at  the  rate  of  a  geometric  progression. 

Proof  of  Lemma  A. Is  Given  any  vector  x= (x. , . . . ,x„)  we  decompose  it  as  x=y+ce, 

1  M 

where  c  is  a  scalar,  e is  the  vector  with  all  entries  equal  to  1  and  y  has  one 
zero  entry  and  all  other  entries  are  nonnegative.  (So  c  equals  the  minimum  of 
the  components  of  x. 

n 

Let  x(n)  =  II  D  x  (0) ,  Vn  and  x(n)=y(n)+c(n)e.  It  is  easy  to  see  that 
k=l 

|  |y(n+l)  |  1^  (1-a)  |  |y(n)  |  1^,  where  ||  ’  Moo  denotes  the  max-norm  on  R  ,  which 

shows  that  y(n)  converges  geometrically  to  zero.  Moreover,  c(n)<^  c(n+l)£  c(n)  + 

| |y(n) | |  ,  which  shows  that  c(n)  also  converges  geometrically  to  some  c.  Hence, 

x(n)  converges  geometrically  to  ce.  Since  this  is  true  for  any  x(0)e  Rn,  parts 

(a) ,  (b) , (d)  of  the  Lemma  follow.  Part  (c)  is  proved  by  an  easy  induction,  for 

the  finite  products  of  the  D^'s  and,  therefore,  it  holds  for  D  as  well.  □ 

We  now  consider  the  case  where  Assumption  5.2.2  is  replaced  by  5.2.3. 

6  8 

The  key  observation  here  is  that  during  an  interval  of  the  form  [B^n  ,  B^(n+l)p] 
a  bounded  number  of  messages  is  transmitted;  hence,  A(k)=I,  except  for  a  bounded 
number  of  times  in  that  interval.  If  we  redefine  the  time  variable  so  that  time 
is  incremented  only  at  communication  times,  we  have  reduced  the  problem  to  the  ca: 
of  Assumption  5.2.2.  The  only  difference,  due  to  the  change  of  the  time  variable 
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is  in  the  rate  of  convergence.  Under  Assumption  5.2.2,  | | $ (n| k) Oc ) | |  decreases 
by  a  constant  factor  during  intervals  of  constant  length.  This  implies  that, 
under  Assumption  5.2.3,  j |${n|k)-<Mk) | |  decreases  by  a  constant  factor  during 
intervals  of  the  form  [B^t  ,B^(t+c)  ],  for  some  appropriate  constant  c.  Therefore, 
if  B^t^=k  and  B^(t+m)^=n,  we  have  | |$(n|k)-$(k) | |<_  Bd J  ,  for  some  B>0,  d_6[0,l) . 

Eliminating  t  and  solving  for  m,  we  obtain 


which  finally  yields 


|$(n|k)-$(k) | |<  Bd 
1 


CB 


1/8 


Let  6=1/8  and  d=d 


,  to  recover  the  desired  result. 
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